Where can you eat safely in Chigago : an analysis based on the inspections reports from the Chicago department of Public Health's Food Protection Program.

1. Introduction

The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (grocery stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.

The main research question is : Where can you eat safely in Chicago ?

The adjective safely is chosen wisely because it has multiple meanings : you can eat in a safe way making sure that the establishment where you go respects particular hygiene rules, but also that the place is safe according to the crime rate of its district.

To begin, it has been decided that the analysis of the data will be done temporally. Indeed, the field of restoration is known to be in a constant evolution : each year, many establishments are opening while other are closing or changing of owner - especially in big cities. This is why the different calculations and comparisons will be done by year. To answer the main question about where to eat in Chicago, the answer will be given based on the latest data. Also, a map visualization seems to be a good solution to diplay the results, as the analysis is spatio-temporal.

Hygiene Score

The hygiene of the public and private food establishments listed in the dataset will be calculated using on a hygiene score based on the different violations reported and the final result of the inspection. In order to compare the places to eat by delimited zones, a mean will be computed by community area.

Questions :

Are their significant differences of hygiene scores between the community areas ?

How is the evolution over time ?

A little study about Management

Two datasets have been found, which contain the owner's et license's informations of Chicago establishments. It would be interesting to merge it with the first dataset in order to study the trends regarding the management of the establishments.

Questions :

Is their a relation between the number of establishments that an owner has and the hygiene scores obtained ?

In which way a change of owner changes the hygiene score of an establishment ?

Crime Score

The Chicago Crime Report can be used in order to compute a crime score for each community area, following the same principle than in the first point. This way, it will be possible to compare the crime and hygiene scores according to the community areas.

Questions :

Criminology domain

Is their a relation between the crime score and the hygiene scores obtained in the Comm Area ?

Limits :

An important point is to pay intention to the number of inspected establishments compared to the total number of establishments. It is certain than the variations of this ratio between the community areas has an impact on the results. An explanation of the variations should be purposed. In order to give a complete answer to the main research question, the uninspected establishments have to be taken into account.

Definitions :

Hygiene Rules : Food Code Rules

Community Area : Chicago Community Area separation (availible on the site of Chicago city)

The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
import folium
import os
import json
from folium.plugins import MarkerCluster

2. Preprocessing

First a quick look at how the dataset is organized.

In [2]:
eat_seat = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
eat_seat.head(3)
Out[2]:
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip ... Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards
0 2352683 CHIPOTLE MEXICAN GRILL CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant Risk 1 (High) 1025 W ADDISON ST CHICAGO IL 60613.0 ... Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 {'latitude': '-87.6553811622124', 'longitude':... NaN NaN NaN NaN NaN
1 2352734 CHILI'S T-I CHILI'S (T1-B14) 34169.0 Restaurant Risk 1 (High) 11601 W TOUHY AVE CHICAGO IL 60666.0 ... Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 {'latitude': '-87.91442843927047', 'longitude'... NaN NaN NaN NaN NaN
2 2352713 NICK'S FIRE GRILL STEAK & LEMONADE INC. NICK'S FIRE GRILL STEAK & LEMONADE INC. 2699090.0 NaN Risk 1 (High) 2900 W 63RD ST CHICAGO IL 60629.0 ... Not Ready NaN 41.779190 -87.695892 {'latitude': '-87.69589237398256', 'longitude'... NaN NaN NaN NaN NaN

3 rows × 22 columns

In [3]:
eat_seat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196825 entries, 0 to 196824
Data columns (total 22 columns):
Inspection ID                 196825 non-null int64
DBA Name                      196825 non-null object
AKA Name                      194367 non-null object
License #                     196808 non-null float64
Facility Type                 192057 non-null object
Risk                          196756 non-null object
Address                       196825 non-null object
City                          196686 non-null object
State                         196783 non-null object
Zip                           196775 non-null float64
Inspection Date               196825 non-null object
Inspection Type               196824 non-null object
Results                       196825 non-null object
Violations                    144559 non-null object
Latitude                      196135 non-null float64
Longitude                     196135 non-null float64
Location                      196135 non-null object
Historical Wards 2003-2015    0 non-null float64
Zip Codes                     0 non-null float64
Community Areas               0 non-null float64
Census Tracts                 0 non-null float64
Wards                         0 non-null float64
dtypes: float64(9), int64(1), object(12)
memory usage: 33.0+ MB

The dataset contains :

  • 195'312 entries
  • 22 columns (listed above)

2.1 Columns Cleaning

Duplicates management

To begin, it is primary to check if there are duplicates and, if so, decide how to handle them.

In [4]:
duplicates = eat_seat[eat_seat.duplicated(subset=None, keep=False)]
len(duplicates)
Out[4]:
298

It is not realistic that two rows contain the exact same cells : two inspections cannot happen at the same time, the explanation is that one is the duplicate of the other. Considering this postulate, we can drop the duplicates.

In [5]:
eat_seat = eat_seat.drop_duplicates()
In [6]:
len(eat_seat)
Out[6]:
196676

Missing and useless values

Now, we want to see which values are missing.

In [7]:
eat_seat.isnull().sum()
Out[7]:
Inspection ID                      0
DBA Name                           0
AKA Name                        2457
License #                         17
Facility Type                   4768
Risk                              69
Address                            0
City                             139
State                             42
Zip                               50
Inspection Date                    0
Inspection Type                    1
Results                            0
Violations                     52234
Latitude                         689
Longitude                        689
Location                         689
Historical Wards 2003-2015    196676
Zip Codes                     196676
Community Areas               196676
Census Tracts                 196676
Wards                         196676
dtype: int64
In [8]:
preview = pd.DataFrame(eat_seat.isnull().sum()).plot.barh(title = 'Missing Values by Columns')

As we can see, the columns Historical Wards 2003-2015, Zip Codes, Community Areas, Census Tracts and Wards are empty and will be dropped.

We will only be using the DBA Name (the name under which the establishment is doing business ; DBA = doing business as), so we drop the AKA Name column too.

Also, we will use the Latitude and Longitude in our analysis, so we can drop de Location, Zip, City and State columns too. Address column is kept because it can help us to notice some inconsistencies.

In [9]:
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
                                    'Census Tracts', 'Location', 'Wards', 'City', 'State', 'Zip'])
eat_seat.head(2)
Out[9]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant Risk 1 (High) 1025 W ADDISON ST 2019-12-04T00:00:00.000 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381
1 2352734 CHILI'S T-I 34169.0 Restaurant Risk 1 (High) 11601 W TOUHY AVE 2019-12-04T00:00:00.000 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428

Type Casting

In order to standardize the data, the numbers in the License column are converted into integer.

In [10]:
eat_seat['License #'] = pd.to_numeric(eat_seat['License #'], downcast= 'integer')
In [11]:
def nbestethic(nb) :
    try :
        return int(nb)
    except :
        print(nb)
        return nb
In [12]:
eat_seat['License #'] = eat_seat['License #'].apply(nbestethic)
eat_seat.head(2)
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
Out[12]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant Risk 1 (High) 1025 W ADDISON ST 2019-12-04T00:00:00.000 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381
1 2352734 CHILI'S T-I 34169.0 Restaurant Risk 1 (High) 11601 W TOUHY AVE 2019-12-04T00:00:00.000 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428

Inspection Date and Risk

There are also adjustments to make in the following columns, because the formats can be optimized :

  • In Inspection Date, only the day will be kept, not the time of day that is actually not given
  • In Risk, only the number will remain
In [13]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])
In [14]:
from datetime import datetime, date, time
In [15]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:datetime.strptime(x, '%Y-%m-%d'))
In [16]:
def inspection_per_year(dblp):
    return dblp[['Inspection Date', 'Inspection ID']].groupby('Inspection Date').count()

ax = inspection_per_year(eat_seat).plot(rot=-45)
ax.set_xlabel('Year')
ax.set_ylabel('#Inspections')
Out[16]:
Text(0, 0.5, '#Inspections')
In [17]:
len(eat_seat[eat_seat['Inspection Date'].isnull()])
Out[17]:
0

We see in this graph that all the Inspection Date are between 2010 and 2019, which makes sense. Also, there is no missing inspection date. This column is now cleaned.

For the Risk column, we first want to check what types of risk are listed.

In [18]:
eat_seat.Risk.unique()
Out[18]:
array(['Risk 1 (High)', 'Risk 3 (Low)', 'Risk 2 (Medium)', 'All', nan],
      dtype=object)

We will replace All and High Risk by 3, Medium Risk by 2 and Low Risk by 1.

In [19]:
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':3, 'Risk 1 (High)':3, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':1})
eat_seat['Risk'] = pd.to_numeric(eat_seat['Risk'], downcast= 'integer')
eat_seat.head(3)
Out[19]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381
1 2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428
2 2352713 NICK'S FIRE GRILL STEAK & LEMONADE INC. 2699090.0 NaN 3.0 2900 W 63RD ST 2019-12-04 License Not Ready NaN 41.779190 -87.695892
In [20]:
def inspection_per_risk(dblp):
    return dblp[['Risk', 'Inspection ID']].groupby('Risk').count()


ax = inspection_per_risk(eat_seat).plot(kind="barh")
ax.set_ylabel('Risk')
ax.set_xlabel('#Inspections')
Out[20]:
Text(0.5, 0, '#Inspections')
In [21]:
len(eat_seat[eat_seat['Risk'].isnull()])
Out[21]:
69

We see in this graph that all the Risk are between 1 and 3. There is, for now, 73 missing values, but as it is not a problem for now, we will just keep it in mind.

2.2 Facilities of interest Selection

Then, we want to put the different facility types in categories that make sense for our project.

First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.

In [23]:
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
              'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
                                    'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
                                    'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
                                    'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
              'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
                         'banquets', 'Banquet Dining',  'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
                         'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
              'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
              'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
                                'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
                                'TAVERN GRILL'], \
              'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
              'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
              'catering' : ['CATERING/CAFE', 'Catering'], \
              'golden_diner' : ['Golden Diner']}
In [24]:
facilitytype = 'BANQUET'
len(eat_seat[eat_seat['Facility Type'] == facilitytype])
Out[24]:
64

This command returns the number of occurencs of the Facility Type inputed.

With trying different types previously categorized and listed in the public_dic dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.

In [25]:
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
                          'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
                          'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
                          'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
                          'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
               'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
                         'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
                         'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
                         'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
                         'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
                         'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
               'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
                                     "CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
                                     'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES  FACILITY", '1023'], \
               'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
                            'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
                            'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
                            'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
                            'Long-Term Care','Long Term Care Facility', 'VFW HALL']}
In [26]:
total_dic = {**public_dic , **private_dic}
In [27]:
#inverting the dict
facilities = {}
for key in total_dic :
    for facility in total_dic[key] :
        facilities[facility] = key
In [28]:
facilitygroups = pd.DataFrame(data = facilities.values(), index=facilities.keys(), columns = ['FacilityGroup'])
facilitygroups.head(3)
Out[28]:
FacilityGroup
Restaurant restaurant
DINING HALL restaurant
TENT RSTAURANT restaurant

Creating the facilitygroups Dataframe, we'll next merge it with the eat_seat DataFrame

In [29]:
facilitygroups.index.name = 'Facility Type'

Changing the index name to merge the two DataFrame on it

In [30]:
eat_seat = pd.merge(eat_seat, facilitygroups, on = ['Facility Type'])
eat_seat.head(3)
Out[30]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant
1 2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant
2 2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant

Merging the 2 DataFrame on the Facility Type Column

Give us the FacilityGroup column

In [31]:
def inspection_per_facility(dblp):
    return dblp[['FacilityGroup', 'Inspection ID']].groupby('FacilityGroup').count()


ax = inspection_per_facility(eat_seat).plot(kind="barh")
ax.set_ylabel('Facility Type')
ax.set_xlabel('#Inspections')
Out[31]:
Text(0.5, 0, '#Inspections')

2.3 Public or Private ?

In [32]:
public_private = {'daycare' : 'Private', 'school' : 'Private', 'children-services' : 'Private', 'adultcare' : 'Private',\
                  'restaurant' : 'Public', 'grocery_restaurant' : 'Public', 'banquet' : 'Public', 'rooftop_restaurant' : 'Public',\
                'bar_restaurant' : 'Public', 'bakery_restaurant' : 'Public', 'liquor_restaurant' : 'Public' , 'catering' : 'Public', 'golden_diner' : 'Public'}

This public_private dictionnary is filled with the global type (public vs private) of each type of facility

In [33]:
public_private_df = pd.DataFrame(data = public_private.values(), index=public_private.keys(), columns = ['Public_Private'])
public_private_df.head(5)
Out[33]:
Public_Private
daycare Private
school Private
children-services Private
adultcare Private
restaurant Public

Creating the public_private_df to next merge it with the eat_seat DataFrame

In [34]:
public_private_df.index.name = 'FacilityGroup'

Renaming the index name to merge on it

In [35]:
eat_seat = pd.merge(eat_seat, public_private_df, on = ['FacilityGroup'])
eat_seat.head(3)
Out[35]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public
1 2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public
2 2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public

Merging the public_private_df and the eat_seat on FacilityGroup to create the Public_Private column

2.4 Non-defined licenses numbers

There are also missing license numbers, corresponding to the nan printed in the output of the subsection "Type casting". We see in the graph below that there is some inspection numbers corresponding to 0, which is not a valid license number.

In [36]:
def inspection_per_license(dblp):
    return dblp[['License #', 'Inspection ID']].groupby('License #').count()

ax = inspection_per_license(eat_seat).plot(rot=-45)
ax.set_xlabel('License #')
ax.set_ylabel('#Inspections')
Out[36]:
Text(0, 0.5, '#Inspections')

The following dataframe displays the establishments where the license number is missing or is corresponding to 0.0.

In [37]:
missing = eat_seat[eat_seat['License #'].isnull() | eat_seat['License #'] == 0.0]
missing.head()
Out[37]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private
3931 2303443 SUBWAY SANDWICHES 0.0 Restaurant 3.0 3951 N KIMBALL AVE 2019-07-22 Canvass Fail 2. CITY OF CHICAGO FOOD SERVICE SANITATION CER... 41.953544 -87.712743 restaurant Public
4723 2300450 SUBWAY 0.0 Restaurant 3.0 4771 N LINCOLN AVE 2019-06-28 Canvass Out of Business NaN 41.968506 -87.688338 restaurant Public
5096 2293900 SUBWAY 0.0 Restaurant 3.0 4771 N LINCOLN AVE 2019-06-20 Canvass Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.968506 -87.688338 restaurant Public
5430 2293545 SUBWAY 0.0 Restaurant 3.0 4771 N LINCOLN AVE 2019-06-13 Canvass Fail 1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW... 41.968506 -87.688338 restaurant Public
8457 2282242 ALTHEA BY MKC 0.0 Restaurant 3.0 700 N MICHIGAN AVE 2019-03-28 Canvass Out of Business NaN 41.894980 -87.624400 restaurant Public
In [38]:
len(missing)
Out[38]:
172

In order to resolve this problem, we search if there are other establishments with the same DBA Name so that, if so, we can attribute their license number to the non-defined license ones.

In [39]:
missing_found = eat_seat[eat_seat['DBA Name'].isin(missing['DBA Name']) & (eat_seat['License #'].notnull() & eat_seat['License #'] != 0.0)]
missing_found.head()
Out[39]:
Inspection ID DBA Name License # Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private
10 2352617 SUBWAY 2703766.0 Restaurant 3.0 615 W LAKE ST 2019-12-03 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.885594 -87.643462 restaurant Public
259 2345950 BRITISH AIRWAYS 2517808.0 Restaurant 3.0 11601 W TOUHY AVE 2019-11-19 Canvass Pass w/ Conditions 22. PROPER COLD HOLDING TEMPERATURES - Comment... 42.008536 -87.914428 restaurant Public
501 2345318 SUBWAY 2529116.0 Restaurant 3.0 2620 N NARRAGANSETT AVE 2019-11-08 Canvass Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.927995 -87.785752 restaurant Public
607 2321166 SUBWAY 2529116.0 Restaurant 3.0 2620 N NARRAGANSETT AVE 2019-11-06 Canvass Fail 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.927995 -87.785752 restaurant Public
760 2320750 SUBWAY 2522465.0 Restaurant 3.0 6352 N Milwaukee AVE 2019-10-30 Canvass Pass 39. CONTAMINATION PREVENTED DURING FOOD PREPAR... 41.996447 -87.787207 restaurant Public
In [40]:
len(missing_found)
Out[40]:
3415

We see that there is 3396 entries in the database having a License number and corresponding to an esthablishment who has at some point a missing License number. We will search in those entries the ones corresponding to the DBA name, Latitude and Longitude of the ones having a missing License number, take the latest entry (the latest inspection), and infer that the missing License number can be replaced by the one of this entry.

In [41]:
missing_found = missing_found.drop(columns = ['Facility Type', 'Risk', 'Address', 'Inspection Type', 'Results', 'Violations', 'FacilityGroup'])
In [42]:
latest = pd.DataFrame(missing_found.groupby(['DBA Name', 'Latitude', 'Longitude'])['Inspection Date'].max())
latest.head()
Out[42]:
Inspection Date
DBA Name Latitude Longitude
ALTHEA BY MKC 41.894980 -87.624400 2019-05-30
ARAMARK 41.653637 -87.605071 2019-01-29
41.785791 -87.603745 2016-08-22
41.788311 -87.604770 2018-10-31
41.791261 -87.605085 2016-10-27
In [43]:
len(latest)
Out[43]:
319
In [44]:
#DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
latest_v2 = pd.merge(latest, missing_found, how='inner', on=['Inspection Date', 'DBA Name', 'Latitude', 'Longitude'])
latest_v2.head()
Out[44]:
Inspection Date DBA Name Latitude Longitude Inspection ID License # Public_Private
0 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290711 2659160.0 Public
1 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290710 2659161.0 Public
2 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290702 2646659.0 Public
3 2019-01-29 ARAMARK 41.653637 -87.605071 2261263 3759928.0 Private
4 2016-08-22 ARAMARK 41.785791 -87.603745 1950902 2162462.0 Public

In latest we have an entry per unique couple "DBA Name + Latitude + Longitude" corresponding to the latest inspection date found for this couple in missing_found (reminder : in missing_found, there is each entry of eat_seat that has the name of an establishment where at some point the License number is missing). In latest_v2 we keep every entry corresponding to those "DBA Name + Latitude + Longitude + latest inspection date" that appear in missing_found, in order to investigate the potential duplicates.

In [45]:
if len(latest) < len(latest_v2) :
    print('Duplicates detected !')
else : 
    print('No Duplicates !')
Duplicates detected !

We see that there is duplicates of entries because the length of latest_v2 is greater than the length of latest. We investigate that.

In [46]:
#DataFrame.duplicated(self, subset=None, keep='first')
latest_duplicated = latest_v2[latest_v2.duplicated(subset = ['Inspection Date', 'DBA Name', 'Latitude', 'Longitude'], keep=False)]
latest_duplicated
Out[46]:
Inspection Date DBA Name Latitude Longitude Inspection ID License # Public_Private
0 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290711 2659160.0 Public
1 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290710 2659161.0 Public
2 2019-05-30 ALTHEA BY MKC 41.894980 -87.624400 2290702 2646659.0 Public
23 2019-03-21 CHARTWELLS 41.743152 -87.587883 2279746 2549069.0 Private
24 2019-03-21 CHARTWELLS 41.743152 -87.587883 2279761 2549069.0 Private
60 2018-11-05 LA MICHOACANA 41.808399 -87.674814 2232769 2476172.0 Public
61 2018-11-05 LA MICHOACANA 41.808399 -87.674814 2232771 2631675.0 Public
94 2019-08-06 SUBWAY 41.707374 -87.617030 2304209 2683476.0 Public
95 2019-08-06 SUBWAY 41.707374 -87.617030 2304217 1804587.0 Public
110 2012-03-15 SUBWAY 41.750750 -87.629792 1214209 1947565.0 Public
111 2012-03-15 SUBWAY 41.750750 -87.629792 1214210 2036493.0 Public
177 2015-12-07 SUBWAY 41.878003 -87.634299 1607215 1516580.0 Public
178 2015-12-07 SUBWAY 41.878003 -87.634299 1591320 1516580.0 Public
192 2019-01-29 SUBWAY 41.884624 -87.620476 2261295 70217.0 Public
193 2019-01-29 SUBWAY 41.884624 -87.620476 2261270 70217.0 Public
194 2019-01-29 SUBWAY 41.884624 -87.620476 2261283 2646445.0 Public
320 2014-03-13 WENDY'S 41.692476 -87.622842 1401915 1145749.0 Public
321 2014-03-13 WENDY'S 41.692476 -87.622842 1401916 1145749.0 Public

For Chartwells, Subway "41.878003 - 87.634299" and Wendy's those duplicates are not a problem because the License number is the same for each duplicate. But for the other 5 duplicates the License number varies and there is no obvious way to decide which license number to keep; those entries will then be dropped. For the others, the license number hereby found will be used to complete the primary database.

In [47]:
latest_duplicated = latest_duplicated.drop(columns = ['Inspection Date', 'Inspection ID'])
In [48]:
tuples = latest_duplicated.drop_duplicates(subset = ['DBA Name', 'Latitude', 'Longitude', 'License #'], keep = False)
In [49]:
tuples = tuples.drop_duplicates(subset = ['DBA Name', 'Latitude', 'Longitude'], keep = 'first')
In [50]:
tuples = tuples.set_index(['DBA Name', 'Latitude', 'Longitude'])
In [51]:
dup = latest_v2[latest_v2.duplicated(subset = ['DBA Name', 'Latitude', 'Longitude', 'License #'], keep='first')]
In [52]:
dup = dup.set_index(['DBA Name', 'Latitude', 'Longitude'])
In [53]:
unique = dup[~dup.index.isin(tuples.index)]
unique
Out[53]:
Inspection Date Inspection ID License # Public_Private
DBA Name Latitude Longitude
CHARTWELLS 41.743152 -87.587883 2019-03-21 2279761 2549069.0 Private
SUBWAY 41.878003 -87.634299 2015-12-07 1591320 1516580.0 Public
WENDY'S 41.692476 -87.622842 2014-03-13 1401916 1145749.0 Public
In [55]:
latest_v2 = latest_v2.set_index(['DBA Name', 'Latitude', 'Longitude'])
In [56]:
latest_duplicated = latest_duplicated.set_index(['DBA Name', 'Latitude', 'Longitude'])
In [57]:
latest_v2 = latest_v2[~latest_v2.index.isin(latest_duplicated.index)]
In [58]:
to_fill = latest_v2.append(unique)
In [59]:
to_fill.head()
Out[59]:
Inspection Date Inspection ID License # Public_Private
DBA Name Latitude Longitude
ARAMARK 41.653637 -87.605071 2019-01-29 2261263 3759928.0 Private
41.785791 -87.603745 2016-08-22 1950902 2162462.0 Public
41.788311 -87.604770 2018-10-31 2232591 2374384.0 Public
41.791261 -87.605085 2016-10-27 1970345 2374382.0 Public
41.866310 -87.610270 2014-05-30 1453362 2253161.0 Public

Now for every DBA Name, Latitude and Longitude having a missing License number in eat_seat but existing in to_fill, we can replace the missing License number in eat_seat by the one in to_fill. First we replace every Null value by 0.0 as we infered it was the same in the database.

In [60]:
eat_seat['License #'].fillna(0.0, inplace = True)
In [61]:
for i, row in eat_seat.iterrows() :
    if (bool(float(row['License #']) == 0.0)) & ((row['DBA Name'], row['Latitude'], row['Longitude']) in to_fill.index) :
        eat_seat.loc[i, 'License #'] = float(to_fill.loc[(row['DBA Name'], row['Latitude'], row['Longitude']),['License #']]) #replace by the one in to_fill
In [62]:
len(eat_seat[eat_seat['License #'] == 0.0])
Out[62]:
129

We now delete the rows that still have a missing number, corresponding to those we were not able to infer.

In [63]:
missing = eat_seat[eat_seat['License #'] == 0.0]
In [64]:
eat_seat = eat_seat.drop(missing.index)
In [65]:
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'
In [66]:
len(eat_seat.License.unique())
Out[66]:
23371
In [67]:
def inspection_per_license(dblp):
    return dblp[['License', 'Inspection ID']].groupby('License').count()

ax = inspection_per_license(eat_seat).plot(rot=-45)
ax.set_xlabel('License #')
ax.set_ylabel('#Inspections')
Out[67]:
Text(0, 0.5, '#Inspections')

2.5 Community Areas

In [68]:
eat_seat.isnull().sum()
Out[68]:
Inspection ID          0
DBA Name               0
License                0
Facility Type          0
Risk                   2
Address                0
Inspection Date        0
Inspection Type        1
Results                0
Violations         36354
Latitude             466
Longitude            466
FacilityGroup          0
Public_Private         0
dtype: int64

We saw earlier that the Community Area was empty. This is a problem beacause some of our analysis are based on it. We need to find them, and this is possible with the Latitude and Longitude

In [69]:
def inspection_per_latitude(dblp):
    return dblp[['Latitude', 'Inspection ID']].groupby('Latitude').count()

ax = inspection_per_latitude(eat_seat).plot(rot=-45)
ax.set_xlabel('Latitude')
ax.set_ylabel('#Inspections')
Out[69]:
Text(0, 0.5, '#Inspections')
In [70]:
def inspection_per_longitude(dblp):
    return dblp[['Longitude', 'Inspection ID']].groupby('Longitude').count()

ax = inspection_per_longitude(eat_seat).plot(rot=-45)
ax.set_xlabel('Longitude')
ax.set_ylabel('#Inspections')
Out[70]:
Text(0, 0.5, '#Inspections')

We see that there are entries with the Latitude or Longitude still missing. We create a dataframe containing them.

In [71]:
missing_location = eat_seat[(eat_seat['Longitude'].isnull()) | (eat_seat['Latitude'].isnull())]

len(missing_location)
Out[71]:
466
In [72]:
only_longitude_or_latitude = missing_location = eat_seat[(eat_seat['Longitude'].isnull()) ^ (eat_seat['Latitude'].isnull())]
len(only_longitude_or_latitude)
Out[72]:
0

We see that there is no entry where only either the Latitude or the Longitude is missing so we can consider one of the two missing as the only condition for the next iterating processes.

We search in the dataframe eat_seat if those missing_location entries are sometimes entered in the data with the Latitude and Longitude columns not empty.

In [73]:
eat_seat[eat_seat['License'].isin(missing_location['License']) & (eat_seat['Longitude'].notnull())]
Out[73]:
Inspection ID DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private

We see that there is no entry we can use to infere the Latitude and Longitude missing : the rows with undefined coordinates are dropped.

In [74]:
Long_null = eat_seat[eat_seat['Longitude'].isnull()].index
eat_seat.drop(Long_null, inplace=True)

Lat_null = eat_seat[eat_seat['Latitude'].isnull()].index
eat_seat.drop(Lat_null, inplace=True)
In [75]:
eat_seat['Location'] = eat_seat.apply(lambda x : Point(x['Latitude'], x['Longitude']), axis = 1)
eat_seat.head(3)
Out[75]:
Inspection ID DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241)
1 2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048)
2 2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191)

https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6 provides a dataset containing A COMPLETER

It will allows to properly attribute the Community Areas numbers to each establishments, using their Location point and the Polygon of the Community Area.

We check if the Location point is in a Community Area Polygon, and if it's the case, we give to the restaurant the corresponding Community Area number.

In [76]:
commareas = pd.read_csv('CommAreas.csv', error_bad_lines=False)
commareas = commareas.drop(columns = ['PERIMETER', 'AREA', 'COMAREA_', 'COMAREA_ID',\
                                   'COMMUNITY', 'AREA_NUM_1', 'SHAPE_AREA', 'SHAPE_LEN'])
commareas.head(3)
Out[76]:
the_geom AREA_NUMBE
0 MULTIPOLYGON (((-87.60914087617894 41.84469250... 35
1 MULTIPOLYGON (((-87.59215283879394 41.81692934... 36
2 MULTIPOLYGON (((-87.62879823733725 41.80189303... 37
In [77]:
commareas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 2 columns):
the_geom      77 non-null object
AREA_NUMBE    77 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.3+ KB

The clean_polygon function is created to standardized the the_geom column which needs to be in the same format than the Location column from the eat_seat Dataframe.

In [78]:
def clean_polygon(poly) :
    liste = poly.strip('MULTIPOLYGON (((').strip(')))').split(', ')
    list_pairs = []
    for coordinates in liste :
        pair = coordinates.split(' ')
        pair[0] = re.sub("[^0-9.]", "", pair[0])
        pair[1] = re.sub("[^0-9.]", "", pair[1])
        pair = tuple((float(pair[1]), -float(pair[0])))
        list_pairs.append(pair)
    return Polygon(list_pairs)
In [79]:
commareas.the_geom = commareas.the_geom.apply(clean_polygon)
In [80]:
commareas.head(3)
Out[80]:
the_geom AREA_NUMBE
0 POLYGON ((41.84469250265398 -87.60914087617894... 35
1 POLYGON ((41.81692934626684 -87.59215283879394... 36
2 POLYGON ((41.80189303368919 -87.62879823733725... 37
In [81]:
def comarea_attribution(row) :
    var = 0.0
    for i, area in commareas.iterrows() :
        if (area[0].contains(row.loc['Location'])) :
            var = float(area[1])
            break
    return var

This function is supposed to check if the Location point of each row is in a community area polygon. If it is, it returns the Community Area number, otherwise, it returns 0.0

In [83]:
eat_seat['Community Area'] = eat_seat.apply(lambda x : comarea_attribution(x), axis = 1)
In [84]:
eat_seat.head(3)
Out[84]:
Inspection ID DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location Community Area
0 2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241) 6.0
1 2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048) 0.0
2 2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191) 21.0
In [85]:
eat_seat.to_csv('eat_seat.csv')
In [86]:
def facility_per_comarea(dblp):
    return dblp[['Community Area', 'Inspection ID']].groupby('Community Area').count()

ax = facility_per_comarea(eat_seat).plot(rot=-45)
ax.set_xlabel('Community Area')
ax.set_ylabel('#Inspections')
Out[86]:
Text(0, 0.5, '#Inspections')

2.6 The Violations

The Food Code Rules has changed since the 1st July 2018. After investigating those changes, it seems that only the denomination of the violations but not the violation itself has changed, and a few additionnal violations has been added in the possible violations. It means that those changes does not need more processing and can just be considered together as a common list of violations.

In [87]:
len(eat_seat.Violations.unique())
Out[87]:
116780
In [88]:
eat_seat.Violations[0]
Out[88]:
'3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: FOUND NO EMPLOYEE HEALTH POLICY AT PREMISES. PRIORITY FOUNDATION. NO CITATION ISSUED. 7-38-010. MUST PROVIDE.'

It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :

  • Violation number
  • Violation type
  • Violation comments

It seems that every violation cell is architectured this way : "number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector" (this format repeated as many times as the number of violations detected the day of the inspection, separated with a vertical line)

We just want to keep the violation number (because we can check which violation it is online). We create a column NumberViolations containing the ID of the violations found during the corresponding investigation. The rest is not kept because the titles of the violations can be found online and we do not plan on using the comments of the investigators.

As we can do this cleaning only for the rows where the field Violations is not empty, we will temporarily drop all the other rows.

In [89]:
temp = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')
In [90]:
violations = temp.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)
In [91]:
first_violations = temp.apply(lambda row: row['Violations'].split('.')[0], axis = 1)
In [92]:
for violation, first_violation in zip(violations, first_violations):
    violation.append(first_violation)

flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)

This function returns a list of infractions id for each row

In [93]:
temp = temp.assign(NumberViolations = violations)
In [94]:
temp = temp[['Inspection ID', 'NumberViolations']]
temp.head(3)
Out[94]:
Inspection ID NumberViolations
0 2352683 [3]
1 2352734 [47, 49, 55, 55, 10]
2 2352701 [5, 16, 51, 53, 58, 3]
In [95]:
len(eat_seat)
Out[95]:
153650

Now that we have a dataframe with every inspection ID of the inspections where violations has been found and a column containing the list of those violations, we can add it to the primary dataframe.

In [96]:
eat_seat = pd.merge(eat_seat, temp, how='left', on='Inspection ID', left_index=True, right_index=False)
In [97]:
eat_seat = eat_seat.set_index(['Inspection ID']) #redifines the Index
eat_seat.head(3)
Out[97]:
DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location Community Area NumberViolations
Inspection ID
2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241) 6.0 [3]
2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass Pass 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048) 0.0 [47, 49, 55, 55, 10]
2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191) 21.0 [5, 16, 51, 53, 58, 3]
In [98]:
eat_seat.index.is_unique
Out[98]:
True
In [99]:
len(eat_seat)
Out[99]:
153650

2.7 The Inspection Results

In [100]:
eat_seat.isnull().sum()
Out[100]:
DBA Name                0
License                 0
Facility Type           0
Risk                    2
Address                 0
Inspection Date         0
Inspection Type         1
Results                 0
Violations          36281
Latitude                0
Longitude               0
FacilityGroup           0
Public_Private          0
Location                0
Community Area          0
NumberViolations    36281
dtype: int64

We see that there are more than 50'000 rows where the Violations column is empty. We have to study wether those cells are empty because there were no violations (meaning the estalishment is healthy) or because the inspection was not successfull (meaning we can drop the row because it can not be used in our research).

In [101]:
eat_seat.Results.unique()
Out[101]:
array(['Pass w/ Conditions', 'Pass', 'Fail', 'Not Ready',
       'Out of Business', 'No Entry', 'Business Not Located'],
      dtype=object)

We create new databases for every entries of the column Results in order to study them.

In [102]:
noentry = eat_seat[eat_seat['Results']=='No Entry']
In [103]:
outofbusiness = eat_seat[eat_seat['Results']=='Out of Business']
In [104]:
notready = eat_seat[eat_seat['Results']=='Not Ready']
In [105]:
businessnotlocated = eat_seat[eat_seat['Results']=='Business Not Located']
In [106]:
passwithconditions = eat_seat[eat_seat['Results']=='Pass w/ Conditions']
In [107]:
passed = eat_seat[eat_seat['Results']=='Pass']
In [108]:
fail = eat_seat[eat_seat['Results']=='Fail']

We investigate how many of each results' type has the column Violations empty.

In [109]:
results_dic = {'No Entry' : noentry, 'Out of Business' : outofbusiness, 'Not Ready' : notready,\
               'Business Not Located' : businessnotlocated, 'Pass With Conditions' : passwithconditions, 'Pass' : passed, 'Fail' : fail}
In [110]:
for name, result in results_dic.items() :
    print(name, ':', len(result[result['Violations'].isnull()]), 'empty Violation columns /', len(result),\
          'columns =', (len(result[result['Violations'].isnull()])/len(result)), '\n')
No Entry : 5224 empty Violation columns / 5558 columns = 0.9399064411658871 

Out of Business : 10007 empty Violation columns / 10031 columns = 0.9976074170072774 

Not Ready : 1254 empty Violation columns / 1280 columns = 0.9796875 

Business Not Located : 29 empty Violation columns / 29 columns = 1.0 

Pass With Conditions : 391 empty Violation columns / 22901 columns = 0.01707349024060085 

Pass : 17377 empty Violation columns / 85077 columns = 0.20425026740482152 

Fail : 1999 empty Violation columns / 28774 columns = 0.06947244039758114 

We see that almost every entries where the Result is either 'No entry', 'Out of Business', 'Not ready' or 'Business not located' have the Violations field empty. We can safely drop those lines because they are not pertinent for our research.

In [111]:
results = ['Pass', 'Pass w/ Conditions', 'Fail']
In [112]:
eat_seat = eat_seat.loc[eat_seat['Results'].isin(results)]
len(eat_seat)
Out[112]:
136752

Now we have to take care of the cases where there is no Violations and the Result is either Pass, Fail or Pass with conditions.

When the result is Pass and the Violation field is empty, we can add a 0 in the column "NumberViolations".

In [113]:
eat_seat['NumberViolations'].fillna(0, inplace=True)
In [114]:
len(eat_seat)
Out[114]:
136752

When the result is either Fail or Pass with conditions but the Violations field is empty, we will drop those rows because there are missing values. An establishment can indeed not fail an inspection or receive conditions when no violation is found, those entries make no sense and can not be taken into account in our research.

In [115]:
results = ['Pass w/ Conditions', 'Fail']
EmptyViolations = eat_seat[eat_seat['NumberViolations'] == 0]
EmptyinResults = EmptyViolations.loc[eat_seat['Results'].isin(results)]
indexes = EmptyinResults.index
eat_seat = eat_seat.drop(labels = indexes)
In [116]:
len(eat_seat)
Out[116]:
134362

Now we can replace Pass by 1, Pass w/ conditions by 2, and Fail by 3 (that will be useful during the computation of the healthiness score).

In [117]:
eat_seat['Results'].unique()
Out[117]:
array(['Pass w/ Conditions', 'Pass', 'Fail'], dtype=object)
In [118]:
eat_seat['Results'] = eat_seat['Results'].replace({'Fail':3, 'Pass w/ Conditions':2, 'Pass':1})
In [119]:
eat_seat.head(3)
Out[119]:
DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location Community Area NumberViolations
Inspection ID
2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241) 6.0 [3]
2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass 1 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048) 0.0 [47, 49, 55, 55, 10]
2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191) 21.0 [5, 16, 51, 53, 58, 3]
In [120]:
eat_seat.Results = pd.to_numeric(eat_seat.Results, downcast = 'integer')
eat_seat.Results = eat_seat.Results.apply(nbestethic) 

Now we can compute the healthiness score of each inspection by multiplying the Results score with the number of Violations :

In [121]:
#add 0 in the new column InspectionScore for every row
liste = []
for i in range (0, len(eat_seat)) :
                liste.append(0)
                
eat_seat['InspectionScore'] = liste
In [122]:
#add the multiplication for the rows where NumberViolations != 0
for ID in eat_seat.index :
    if eat_seat.at[ID, 'NumberViolations'] != 0 :
        eat_seat.at[ID, 'InspectionScore'] = len(eat_seat.at[ID, 'NumberViolations']) * eat_seat.at[ID, 'Results'] * eat_seat.at[ID, 'Results'] * eat_seat.at[ID, 'Results']

Here, a little explanation for the formula used to compute the Inspection Score :

We have 2 parameters, the NumberViolations and Results given by the inspectors (1 : ok, 2 : pass w conditions, 3 : fail).

The NumberViolations is not the main value, beacause sometimes, it can cause some biais : if the restaurant gets only 1 infraction, but a big one, and a fail, the score will be 3 / if a restaurant get several infractions, let's say 5 and the note 2, it will get a 10 score.

To avoid this problem, we take the Results^3. Therefore, the NumberViolations is multiplied by 1, 8 or 27. This is more accurate than just multiplying them.

In [123]:
eat_seat.head(3)
Out[123]:
DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location Community Area NumberViolations InspectionScore
Inspection ID
2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241) 6.0 [3] 8
2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass 1 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048) 0.0 [47, 49, 55, 55, 10] 5
2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191) 21.0 [5, 16, 51, 53, 58, 3] 48

2.8 Adding a Inspection Year column

In [125]:
def ret_year(row) :
    return (row.loc['Inspection Date'].year)

Function that returns the year of the inspection based on a datetime object stored in Inspection Date

In [126]:
r = eat_seat.loc[2352683]
In [127]:
a = ret_year(r)

print(a)
2019

Just a quick test, it seems it's working

In [128]:
eat_seat['Inspection Year'] = eat_seat.apply(lambda x : ret_year(x), axis = 1)

Applying this function to the df

In [129]:
eat_seat.head(3)
Out[129]:
DBA Name License Facility Type Risk Address Inspection Date Inspection Type Results Violations Latitude Longitude FacilityGroup Public_Private Location Community Area NumberViolations InspectionScore Inspection Year
Inspection ID
2352683 CHIPOTLE MEXICAN GRILL 2670642.0 Restaurant 3.0 1025 W ADDISON ST 2019-12-04 License Re-Inspection 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.947118 -87.655381 restaurant Public POINT (41.94711774835235 -87.65538116221241) 6.0 [3] 8 2019
2352734 CHILI'S T-I 34169.0 Restaurant 3.0 11601 W TOUHY AVE 2019-12-04 Canvass 1 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI... 42.008536 -87.914428 restaurant Public POINT (42.00853640086873 -87.91442843927048) 0.0 [47, 49, 55, 55, 10] 5 2019
2352701 TAQUERIA BLUE LINE INC. 2703436.0 Restaurant 3.0 3401 W BELMONT AVE 2019-12-04 License 2 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.939168 -87.712546 restaurant Public POINT (41.93916783376965 -87.71254599117191) 21.0 [5, 16, 51, 53, 58, 3] 48 2019
In [130]:
eat_seat.to_csv('eat_seat.csv')
In [131]:
eat_seat = pd.read_csv('eat_seat.csv')
In [132]:
def inspection_per_year(dblp):
    return dblp[['Inspection Year','Inspection ID']].groupby('Inspection Year').count()

ax = inspection_per_year(eat_seat).plot(rot=-45)
ax.set_xlabel('Year')
ax.set_ylabel('#Inspections')
Out[132]:
Text(0, 0.5, '#Inspections')

2.9 Dealing with facilities inspected more than one time in a year

This part will only be used with the owner meanscore and number of restaurants. For the more global analysis, we can keep eat_seat. Here we want to take median if the restaurant has been insepected more than one time a year, and then groupby License to compute a mean of the Inspection Score for each License + DBA Name

In [134]:
duplicates = eat_seat[eat_seat.duplicated(['License','DBA Name', 'Inspection Year'])]
len(duplicates)
Out[134]:
47254

Here we can see that several facilities were inspected more than one time in a year

In [135]:
eat_seat_facility = eat_seat[['License','DBA Name','Community Area','InspectionScore','Inspection Year']].copy()\
                    .groupby(['License','DBA Name','Community Area','Inspection Year']).median()

Selecting the License/DBA Name/Comm Area/Inspection Score/Inspection Year, groupping them on License/DBA Name/Comm Area/Inspection Year and getting the medain for the Inspection Score.

We now have only the median Inspection Score for each facility for each year

In [137]:
eat_seat_facility.head(10)
Out[137]:
InspectionScore
License DBA Name Community Area Inspection Year
2.0 COSI 32.0 2010 5.0
2011 5.0
2012 16.0
2013 3.0
2014 3.5
2015 5.0
2016 20.0
2017 4.0
2018 16.0
9.0 XANDO COFFEE & BAR / COSI SANDWICH BAR 32.0 2010 6.0
In [139]:
eat_seat_facility.reset_index()
Out[139]:
License DBA Name Community Area Inspection Year InspectionScore
0 2.0 COSI 32.0 2010 5.0
1 2.0 COSI 32.0 2011 5.0
2 2.0 COSI 32.0 2012 16.0
3 2.0 COSI 32.0 2013 3.0
4 2.0 COSI 32.0 2014 3.5
... ... ... ... ... ...
87103 3846027.0 ENGLEWOOD STEM H.S. 68.0 2019 24.0
87104 3846104.0 CHICAGO COLLEGIATE CHARTER 50.0 2019 101.0
87105 3846211.0 ST. THERESE WEST 60.0 2019 16.0
87106 3846407.0 CHICAGO WALDORF SCHOOL 77.0 2019 101.0
87107 3846525.0 INTRINSIC SCHOOLS 32.0 2019 55.0

87108 rows × 5 columns

Now we wante to get the mean for each restaurant (License + DBA Name) to merge them wih the restaurant owner

In [142]:
eat_seat_facility.to_csv('eat_seat_facility.csv')
In [143]:
eat_seat_facility = pd.read_csv('eat_seat_facility.csv')

Storing them

In [149]:
duplicates = eat_seat_facility[eat_seat_facility.duplicated(['License','DBA Name','Inspection Year'])]
len(duplicates)
Out[149]:
0
In [144]:
eat_seat_facility_2 = eat_seat_facility[['License','DBA Name','Community Area','InspectionScore']].copy()\
                    .groupby(['License','DBA Name','Community Area']).mean()

Here we only group on the License/DBA Name/Community Area to take the mean of the Inspection Score

In [147]:
eat_seat_facility_2.head(3)
Out[147]:
InspectionScore
License DBA Name Community Area
2.0 COSI 32.0 8.611111
9.0 XANDO COFFEE & BAR / COSI SANDWICH BAR 32.0 30.900000
40.0 COSI 32.0 34.000000
In [148]:
eat_seat_facility_2.to_csv('eat_seat_facility_2.csv')
In [150]:
eat_seat_facility_2 = pd.read_csv('eat_seat_facility_2.csv')

Storing it

In [151]:
eat_seat_facility_2.head(3)
Out[151]:
License DBA Name Community Area InspectionScore
0 2.0 COSI 32.0 8.611111
1 9.0 XANDO COFFEE & BAR / COSI SANDWICH BAR 32.0 30.900000
2 40.0 COSI 32.0 34.000000

Now we see the precedent problem disappeared

3. Additional Dataset - BUSINESS LICENSES/OWNERS

We found two datasets on Kaggle https://www.kaggle.com/chicago/chicago-business-licenses-and-owners, gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.

3.1 Business Licenses

The first dataset contains the details about every licensed establishments. There are a lot of columns but the only ones interesting us are :

  • the license id column to have a link with the chicago food inspections dataset
  • the account number column to have a link with the business owners dataset
  • the police district column in case we want to have a link with the crime dataset
In [152]:
licenses = pd.read_csv('business-licenses.csv', sep=',', low_memory = False) #creation of the dataframe
In [153]:
licenses = pd.read_csv('business-licenses.csv', sep=',') #creation of the dataframe
licenses = licenses.rename(str.lower, axis='columns')
licenses = licenses.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
                                'license number', 'application type', 'application created date',\
                                'application requirements complete', 'payment date', 'conditional approval',\
                                'license term start date', 'license term expiration date', 'license approved for issuance',\
                                'date issued', 'license status', 'license status change date', 'ssa',\
                                'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
                                'license code', 'license description', 'business activity', 'site number',\
                               'zip code', 'latitude', 'longitude', 'address', 'legal name', 'doing business as name',\
                                'community areas', 'ward'])

licenses['police district'] = pd.to_numeric(licenses['police district'], downcast = 'integer') 
licenses.head(3)
C:\Users\gaguy\Anaconda3\envs\ada\lib\site-packages\IPython\core\interactiveshell.py:3058: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[153]:
license id account number police district
0 1480073 1 1.0
1 1278029 1 1.0
2 1337924 1 1.0
In [154]:
licenses = licenses.set_index('account number')
licenses.head(3)
Out[154]:
license id police district
account number
1 1480073 1.0
1 1278029 1.0
1 1337924 1.0
In [155]:
print(len(licenses.index))
991117

3.2 Business Owners

The second dataset contains the details about every license owners. We have decided to keep the following columns :

  • the account number column to have a link with the business licenses dataset
  • the owner first name and the owner last name columns in order to create a full name column
In [156]:
owners = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owners = owners.rename(str.lower, axis='columns')
owners = owners.drop(columns = ['suffix', 'legal entity owner', 'owner middle initial', 'legal name', 'title'])
owners.head(3)
Out[156]:
account number owner first name owner last name
0 373231 GUY SELLARS
1 203002 NELCY SANTANA
2 338012 GREGORY EDINGBURG
In [157]:
owners['full name'] = owners['owner first name'] + ' ' + owners['owner last name']
owners = owners.drop(columns = ['owner first name', 'owner last name'])

A full name column is enough for the needs that we have to link the licenses to the owners.

In [158]:
owners.head(3)
Out[158]:
account number full name
0 373231 GUY SELLARS
1 203002 NELCY SANTANA
2 338012 GREGORY EDINGBURG
In [159]:
len(owners['account number'].unique())
Out[159]:
168162
In [160]:
len(owners['full name'].unique())
Out[160]:
192520

Here we can see that the number of accounts is not the same that the number of full names. For now, we will consider that a same account can be shared by several people (for example, it could be the case for partners owning a business together).

In [161]:
owners = pd.DataFrame(owners.groupby('account number')['full name'].apply(list))
In [162]:
owners.head(3)
Out[162]:
full name
account number
1 [PETER BERGHOFF, PETER BERGHOFF, HERMAN BERGHOFF]
2 [nan, HERMAN BERGHOFF, PETER BERGHOFF, nan, EI...
4 [LAWRENCE PRICE]

We can observe that the lists of the full name column contain duplicates and 'nan' values. A function clean_list can be defined to clean them.

In [163]:
def clean_list(liste) :
    cleaned = []
    for element in liste :
        if type(element) == str and element not in cleaned :
            cleaned.append(element)
    return cleaned
In [164]:
owners['full name'] = owners['full name'].apply(clean_list)
In [165]:
owners.head(3)
Out[165]:
full name
account number
1 [PETER BERGHOFF, HERMAN BERGHOFF]
2 [HERMAN BERGHOFF, PETER BERGHOFF, EILEEN GORMA...
4 [LAWRENCE PRICE]

3.3 Business Licenses-Owners

Setting both the indexes of the licenses and the owners dataframes we can now merge them together.

In [166]:
business = pd.merge(licenses, owners, right_index = True, left_index = True)
In [167]:
business.head(3)
Out[167]:
license id police district full name
account number
1 1480073 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]
1 1278029 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]
1 1337924 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]
In [168]:
business.reset_index(level=0, inplace=True)
In [169]:
business.head(3)
Out[169]:
account number license id police district full name
0 1 1480073 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]
1 1 1278029 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]
2 1 1337924 1.0 [PETER BERGHOFF, HERMAN BERGHOFF]

3.4 Second Main Dataframe

Setting both the indexes of the business and the eat_seat dataframes we can now merge them together.

In [173]:
business = business.rename(columns= {'license id' : 'License'})
eat_seat_2 = pd.merge(eat_seat_facility_2, business, on = ['License'])
In [174]:
eat_seat_2.head(3)
Out[174]:
License DBA Name Community Area InspectionScore account number police district full name
0 1770.0 VALENTINO CLUB CAFE 18.0 1.00 81 10.0 [RUDOLFO GUERRERO, JOSE GUERRERO]
1 7141.0 SOUTHWEST MONTESSORI PRE SCHOL 71.0 11.00 282 20.0 [HECTOR RODRIGUEZ]
2 15622.0 ORIGINAL PANCAKE HOUSE 72.0 37.75 86515 12.0 [GERALD BURGER, GREGORY MULDOON]
In [175]:
eat_seat_2.to_csv('newfood.csv')

3.5 Dataframe with owner, mean of his restaurants inspection score and number of restaurants owned

In [176]:
license_owner = eat_seat_2.explode('full name').dropna(how='any', subset=['full name'])

Full Name column is filled with name of the multiple owners of each restaurant. These names are stored in a list and we one owner on each row to be able to group the data on them. This is what this cell does ! We also delete the row without the name of the owner

In [177]:
license_owner.head(3)
Out[177]:
License DBA Name Community Area InspectionScore account number police district full name
0 1770.0 VALENTINO CLUB CAFE 18.0 1.0 81 10.0 RUDOLFO GUERRERO
0 1770.0 VALENTINO CLUB CAFE 18.0 1.0 81 10.0 JOSE GUERRERO
1 7141.0 SOUTHWEST MONTESSORI PRE SCHOL 71.0 11.0 282 20.0 HECTOR RODRIGUEZ

Looking if the explode was done correctly

In [178]:
license_owner.isnull().sum()
Out[178]:
License               0
DBA Name              0
Community Area        0
InspectionScore       0
account number        0
police district    2137
full name             0
dtype: int64

We can see here there is no more missin value for the Full Name column

In [179]:
license_owner_score_number = license_owner[['full name','InspectionScore']].copy().groupby('full name').mean()

Grouping the inspection score on the Full Name (owner), and then taking the median of the inspection score of all restaurants that belong to the owner

In [180]:
license_owner_score_number.head(3)
Out[180]:
InspectionScore
full name
A MOHIT 0.000000
AAMIR DAWOOD 40.666667
AAMIR HUSSAIN 40.000000
In [181]:
license_owner_score_number['N Restaurants'] = license_owner[['full name','DBA Name']].copy().groupby('full name').count()['DBA Name']

Grouping the DBA Name on the Full Name (owner), and then counting them to get the number of all the restaurants that belong to the owner

In [182]:
license_owner_score_number.head(10)
Out[182]:
InspectionScore N Restaurants
full name
A MOHIT 0.000000 1
AAMIR DAWOOD 40.666667 3
AAMIR HUSSAIN 40.000000 1
AAMIR KHAN 2.000000 1
AARON BUSH 109.000000 1
AARON CHIAPAKDEE 48.166667 4
AARON CRANE 30.333333 1
AARON CRUMBAUGH 0.000000 1
AARON DIMARIA 122.000000 1
AARON HOOD 19.729167 4
In [183]:
license_owner_score_number.to_csv('license_owner_score_number.csv')

Storing the dataframe

3.6 Dataframe with facility Group and median score for each facility type

In [184]:
Eat_Year_Type = pd.DataFrame()
Eat_Year_Type['FacilityGroup'] = ['childrens_services','adultcare','bakery_restaurant','banquet','bar_restaurant','catering','daycare'\
                                  ,'golden_diner','grocery_restaurant'\
                                  ,'liquor_restaurant','restaurant','rooftop_restaurant','school']
Eat_Year_Type
Out[184]:
FacilityGroup
0 childrens_services
1 adultcare
2 bakery_restaurant
3 banquet
4 bar_restaurant
5 catering
6 daycare
7 golden_diner
8 grocery_restaurant
9 liquor_restaurant
10 restaurant
11 rooftop_restaurant
12 school
In [185]:
for x in range(2010,2020) :
    df = eat_seat.loc[eat_seat['Inspection Year'] == x][['FacilityGroup','InspectionScore']].copy().groupby('FacilityGroup').mean()
    df_2 = pd.merge(Eat_Year_Type, df, how = 'outer', on = ['FacilityGroup'])
    Eat_Year_Type[x] = df_2['InspectionScore']

This cell select the good row for each year, and then group the Inspection Score (mean) on the FacilityGroup

In [186]:
Eat_Year_Type
Out[186]:
FacilityGroup 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 childrens_services NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 adultcare 33.910798 39.953846 35.508380 38.783505 40.306667 33.133047 38.318182 41.488000 73.512605 62.645933
2 bakery_restaurant 5.000000 24.000000 64.000000 81.500000 8.000000 NaN NaN NaN NaN NaN
3 banquet 4.000000 81.937500 63.000000 50.793103 42.250000 18.608696 14.931034 27.533333 163.368421 90.312500
4 bar_restaurant 76.250000 4.500000 45.555556 64.000000 69.500000 69.166667 48.625000 27.500000 91.333333 24.500000
5 catering 42.032258 32.092784 38.716216 24.678161 24.304762 37.810345 32.943548 32.922330 44.378049 65.550562
6 daycare 34.079172 31.232648 35.620979 29.519380 34.801170 19.871795 23.569061 25.119181 43.782609 46.995305
7 golden_diner 30.386364 43.945946 9.621622 25.774194 15.529412 24.622951 24.681159 28.096774 45.967213 37.142857
8 grocery_restaurant 50.058824 33.687500 59.625000 45.916667 19.071429 15.875000 57.312500 81.142857 109.000000 110.444444
9 liquor_restaurant NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
10 restaurant 40.378860 41.392556 40.047406 35.031800 36.985640 41.157266 40.921986 41.441261 60.265968 90.780985
11 rooftop_restaurant 0.000000 55.250000 19.000000 2.000000 14.000000 1.333333 1.500000 NaN NaN NaN
12 school 23.309795 28.157504 32.976542 24.825022 28.070473 30.290861 31.385321 30.081456 41.584048 40.268215
In [187]:
Eat_Year_Type.to_csv('Eat_Type_Year.csv')
In [188]:
Eat_Year_PriPub = pd.DataFrame()
Eat_Year_PriPub['Public_Private'] = ['Public','Private']
Eat_Year_PriPub
Out[188]:
Public_Private
0 Public
1 Private
In [189]:
for x in range(2010,2020) :
    df = eat_seat.loc[eat_seat['Inspection Year'] == x][['Public_Private','InspectionScore']].copy().groupby('Public_Private').mean()
    df_2 = pd.merge(Eat_Year_PriPub, df, how = 'outer', on = ['Public_Private'])
    Eat_Year_PriPub[x] = df_2['InspectionScore']

This cell select the good row for each year, and then group the Inspection Score (mean) on the Public_Private

In [190]:
Eat_Year_PriPub
Out[190]:
Public_Private 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 Public 40.362248 41.35374 39.980160 34.965588 36.791231 40.989888 40.737665 41.318009 60.311690 90.28246
1 Private 28.031051 29.89929 34.122738 26.773817 30.163849 27.668753 30.181954 30.177743 46.173821 45.03603
In [191]:
Eat_Year_Type.to_csv('Eat_PriPub_Year.csv')

Storing the data

4. Additional Dataset - CRIME

In [192]:
crime_2008_2011 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
b'Skipping line 1149094: expected 23 fields, saw 41\n'
In [193]:
crime_2012_2017 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)
In [194]:
crime = crime_2008_2011.append(crime_2012_2017)

crime.head(3)
Out[194]:
Unnamed: 0 ID Case Number Date Block IUCR Primary Type Description Location Description Arrest ... Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
0 388 4785 HP610824 10/07/2008 12:39:00 PM 000XX E 75TH ST 0110 HOMICIDE FIRST DEGREE MURDER ALLEY True ... 6.0 69.0 01A 1178207.0 1855308.0 2008 08/17/2015 03:03:40 PM 41.758276 -87.622451 (41.758275857, -87.622451031)
1 835 4786 HP616595 10/09/2008 03:30:00 AM 048XX W POLK ST 0110 HOMICIDE FIRST DEGREE MURDER STREET True ... 24.0 25.0 01A 1144200.0 1895857.0 2008 08/17/2015 03:03:40 PM 41.870252 -87.746069 (41.87025207, -87.746069362)
2 1334 4787 HP616904 10/09/2008 08:35:00 AM 030XX W MANN DR 0110 HOMICIDE FIRST DEGREE MURDER PARK PROPERTY False ... 18.0 66.0 01A 1157314.0 1859778.0 2008 08/17/2015 03:03:40 PM 41.770990 -87.698901 (41.770990476, -87.698901469)

3 rows × 23 columns

Load the data found on https://www.kaggle.com/currie32/crimes-in-chicago in pandas DataFrames and concatenate them.

In [195]:
crime.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4145424 entries, 0 to 1456713
Data columns (total 23 columns):
Unnamed: 0              int64
ID                      int64
Case Number             object
Date                    object
Block                   object
IUCR                    object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
Beat                    int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                object
X Coordinate            float64
Y Coordinate            float64
Year                    int64
Updated On              object
Latitude                float64
Longitude               float64
Location                object
dtypes: bool(2), float64(7), int64(4), object(10)
memory usage: 703.7+ MB

Taking a quick look at the composition of the Dataframe

In [196]:
crime = crime.drop(['District', 'Ward', 'Unnamed: 0','Case Number','IUCR','Arrest', 'Domestic','FBI Code','X Coordinate','Y Coordinate','Updated On', 'Location Description'], axis=1)
crime.head(3)
Out[196]:
ID Date Block Primary Type Description Beat Community Area Year Latitude Longitude Location
0 4785 10/07/2008 12:39:00 PM 000XX E 75TH ST HOMICIDE FIRST DEGREE MURDER 323 69.0 2008 41.758276 -87.622451 (41.758275857, -87.622451031)
1 4786 10/09/2008 03:30:00 AM 048XX W POLK ST HOMICIDE FIRST DEGREE MURDER 1533 25.0 2008 41.870252 -87.746069 (41.87025207, -87.746069362)
2 4787 10/09/2008 08:35:00 AM 030XX W MANN DR HOMICIDE FIRST DEGREE MURDER 831 66.0 2008 41.770990 -87.698901 (41.770990476, -87.698901469)

The datasets have several columns not interesting for us, there are dropped. In terms of location we decide to only keep the Community Area column in order to have a link with the chicago food inspections dataset. We also keep the longitude and latitude to maybe map the crimes for the visualization.

In [197]:
duplicates = crime[crime.duplicated(keep='first')]
len(duplicates)
Out[197]:
1147495

Taking a look at the duplicates of the crime DataFrame. We see that multiple crimes ares stored more than one time in the dataset

In [198]:
crime = crime.drop_duplicates(keep='first')
In [199]:
len(crime)
Out[199]:
2997929

Deleting the duplicates, and taking a look at the Datafarme length after the deletion of duplicates

In [200]:
for x in crime.columns :
    print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))
ID : False --> 0
Date : False --> 0
Block : False --> 0
Primary Type : False --> 0
Description : False --> 0
Beat : False --> 0
Community Area : True --> 869
Year : False --> 0
Latitude : True --> 52736
Longitude : True --> 52736
Location : True --> 52736

Here we are looking at the missing values. Most of the missing data are part of the Latitude, Longitude and Location columns.

For our analysis, we need the community area value, here 869 community area are missing.

869 is only 0.03% of 2997929, we decide to drop these rows.

In [201]:
crime = crime.dropna(how='any', subset=['Community Area'])

Dropping the row with Community Area missing values

In [202]:
for x in crime.columns :
    print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))
ID : False --> 0
Date : False --> 0
Block : False --> 0
Primary Type : False --> 0
Description : False --> 0
Beat : False --> 0
Community Area : False --> 0
Year : False --> 0
Latitude : True --> 52646
Longitude : True --> 52646
Location : True --> 52646
In [203]:
def block_approx(address) :
    return address.replace('XX','00')

The Block column contains anonymized addresses ("XX"). By replacing the XX by 00 we already can get an approximative position of where the crimes took place.

In [204]:
crime.Block = crime.Block.apply(block_approx)
In [205]:
print(set(crime['Primary Type']))
{'OTHER NARCOTIC VIOLATION', 'GAMBLING', 'INTERFERENCE WITH PUBLIC OFFICER', 'LIQUOR LAW VIOLATION', 'THEFT', 'NON-CRIMINAL (SUBJECT SPECIFIED)', 'WEAPONS VIOLATION', 'STALKING', 'ARSON', 'ROBBERY', 'NON - CRIMINAL', 'PUBLIC PEACE VIOLATION', 'OTHER OFFENSE', 'SEX OFFENSE', 'PROSTITUTION', 'MOTOR VEHICLE THEFT', 'PUBLIC INDECENCY', 'NARCOTICS', 'CONCEALED CARRY LICENSE VIOLATION', 'OBSCENITY', 'HOMICIDE', 'BURGLARY', 'BATTERY', 'NON-CRIMINAL', 'HUMAN TRAFFICKING', 'KIDNAPPING', 'CRIM SEXUAL ASSAULT', 'CRIMINAL DAMAGE', 'INTIMIDATION', 'OFFENSE INVOLVING CHILDREN', 'DECEPTIVE PRACTICE', 'ASSAULT', 'CRIMINAL TRESPASS'}

We print all the Primary Type of crimes in order to create a dictionary with every crime type keyed with their minimum sentence in terms of a year of imprisonment. -- More details below.

In [206]:
crime_penalty_dic = {'PROSTITUTION' : 0.1, 'DOMESTIC VIOLENCE' : 0.1, 'MOTOR VEHICLE THEFT' : 3.0, 'ASSAULT' : 0.1, 'OFFENSE INVOLVING CHILDREN' : 0.1,\
                 'RITUALISM' : 0.1, 'BATTERY' : 0.1, 'NON-CRIMINAL (SUBJECT SPECIFIED)' : 0.1, 'CRIM SEXUAL ASSAULT' : 4.0, 'GAMBLING' : 0.1,\
                 'PUBLIC INDECENCY' : 0.1, 'OTHER OFFENSE' : 0.1, 'LIQUOR LAW VIOLATION' : 0.1, 'OTHER NARCOTIC VIOLATION' : 0.1, 'OBSCENITY' : 0.1,\
                 'NON-CRIMINAL' : 0.1, 'KIDNAPPING' : 3.0, 'HOMICIDE' : 20.0, 'NARCOTICS' : 0.1, 'ARSON' : 6.0, 'DECEPTIVE PRACTICE' : 0.1, 'ROBBERY' : 3.0,\
                 'BURGLARY' : 3.0, 'NON - CRIMINAL' : 0.1, 'INTIMIDATION' : 2.0, 'HUMAN TRAFFICKING' : 4.0, 'SEX OFFENSE' : 4.0, 'CRIMINAL TRESPASS' : 0.1,\
                 'CONCEALED CARRY LICENSE VIOLATION' : 2.0, 'CRIMINAL DAMAGE' : 1.0, 'INTERFERENCE WITH PUBLIC OFFICER' : 0.1, 'PUBLIC PEACE VIOLATION' : 0.1,\
                 'WEAPONS VIOLATION' : 0.1, 'THEFT' : 0.1, 'STALKING' : 0.1}

To make the crime_penalty dictionary, we checked the minimum prison penalty (in years) for each crime types. For the crime where the penalty is less than 1 year, the value is fixed at 0.1 when no prison is needed, and the score in month (6 months = 0.5) if a minmum prison penalty is given in months.

These values are taken from the Illinois Penalty Code.

In [207]:
crime_penalty = pd.DataFrame(data = crime_penalty_dic.values(), index=crime_penalty_dic.keys(), columns = ['Crime Score'])
crime_penalty.head(3)
Out[207]:
Crime Score
PROSTITUTION 0.1
DOMESTIC VIOLENCE 0.1
MOTOR VEHICLE THEFT 3.0
In [208]:
crime_penalty.index.name = 'Primary Type'
In [209]:
crime = pd.merge(crime, crime_penalty, on = ['Primary Type'])
crime.head(3)
Out[209]:
ID Date Block Primary Type Description Beat Community Area Year Latitude Longitude Location Crime Score
0 4785 10/07/2008 12:39:00 PM 00000 E 75TH ST HOMICIDE FIRST DEGREE MURDER 323 69.0 2008 41.758276 -87.622451 (41.758275857, -87.622451031) 20.0
1 4786 10/09/2008 03:30:00 AM 04800 W POLK ST HOMICIDE FIRST DEGREE MURDER 1533 25.0 2008 41.870252 -87.746069 (41.87025207, -87.746069362) 20.0
2 4787 10/09/2008 08:35:00 AM 03000 W MANN DR HOMICIDE FIRST DEGREE MURDER 831 66.0 2008 41.770990 -87.698901 (41.770990476, -87.698901469) 20.0

Merging the crime and crime_penalty Dataframes to compute Crime Score for each row

In [210]:
crime.to_csv('crime.csv')

5. Generating the 3 dataframe for our analysis

Here we generate 2 dataframe : Inspection Score per Year per Comm Area and Crime Score per Year per Comm Area

In [211]:
Crime_Year_CommArea = pd.DataFrame()
Crime_Year_CommArea['Community Area'] = np.arange(0.0,78.0)

Creating a new Dataframe with a column for the 77 Comm Area of Chicago

In [212]:
for x in range(2010,2018) :
    df = crime.loc[crime['Year'] == x][['Community Area','Crime Score']].copy().groupby('Community Area').sum()
    Crime_Year_CommArea[x] = df['Crime Score']

Selecting the comm area and crime score for each year, then group by comm area to compute the sum of crimescore for each comm area for each year

In [213]:
Crime_Year_CommArea
Out[213]:
Community Area 2010 2011 2012 2013 2014 2015 2016 2017
0 0.0 0.3 0.5 4.4 4.1 1.1 1.1 NaN NaN
1 1.0 3572.3 3329.1 2874.9 2447.0 2262.3 2196.0 2380.6 86.0
2 2.0 3975.1 3584.9 3170.7 2526.7 2440.4 2238.1 2497.2 158.2
3 3.0 2884.2 2849.6 3032.2 2244.6 1917.7 2222.5 2110.1 120.2
4 4.0 1851.2 1679.5 1767.3 1529.3 1173.1 1037.9 1274.5 54.1
... ... ... ... ... ... ... ... ... ...
73 73.0 3647.0 3544.9 3068.1 2747.9 2115.3 2226.4 2501.9 89.1
74 74.0 478.9 373.1 345.4 360.7 359.8 266.5 305.3 15.7
75 75.0 2234.1 2259.6 2002.7 1523.8 1320.6 1244.9 1220.0 61.7
76 76.0 604.3 554.3 651.3 609.9 708.9 567.5 707.9 36.9
77 77.0 2482.3 2330.5 2316.4 1682.0 1313.8 1322.5 1548.7 75.5

78 rows × 9 columns

In [214]:
Eat_Year_CommArea = pd.DataFrame()
Eat_Year_CommArea['Community Area'] = np.arange(0.0,78.0)

Creating a new Dataframe with a column for the 77 Comm Area of Chicago

In [215]:
for x in range(2010,2020) :
    df = eat_seat.loc[eat_seat['Inspection Year'] == x][['Community Area','InspectionScore']].copy().groupby('Community Area').mean()
    Eat_Year_CommArea[x] = df['InspectionScore']

Selecting the comm area and inspection score for each year, then group by comm area to compute the median of inspectionscore for each comm area for each year

In [216]:
Eat_Year_CommArea
Out[216]:
Community Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 0.0 23.468966 20.407583 14.798319 21.663636 20.144981 33.620438 30.505495 19.880435 44.850000 42.034615
1 1.0 39.251572 53.355000 62.423645 43.438053 22.062201 56.381395 67.159236 25.429104 55.022831 118.766129
2 2.0 58.363636 62.550336 61.764493 57.908163 26.806202 20.509363 68.905556 58.531579 116.579114 80.261194
3 3.0 56.081448 65.475904 29.535971 35.849057 56.966759 34.114035 41.937870 67.795455 73.959732 89.592920
4 4.0 73.778351 42.085526 37.649746 35.472222 35.032609 32.761538 39.003534 54.254545 54.326733 116.350785
... ... ... ... ... ... ... ... ... ... ... ...
73 73.0 31.488095 43.870968 40.649485 42.123288 37.095890 42.620253 51.509804 40.428571 37.540984 74.315068
74 74.0 23.729167 30.333333 8.775510 11.342105 32.116667 34.405405 34.596154 31.695652 26.657143 98.777778
75 75.0 40.672727 48.355140 21.933333 32.581395 31.824742 54.011494 66.900000 41.951923 64.774648 114.000000
76 76.0 20.434783 38.666667 39.896552 22.818182 16.363636 43.588235 61.558824 63.657143 74.368421 66.733333
77 77.0 44.265116 51.544118 43.405316 43.306513 39.372093 30.093960 29.500000 72.995098 72.823308 53.902834

78 rows × 11 columns

In [217]:
Eat_Year_CommArea.to_csv('Eat_Year_CommArea.csv')
In [218]:
Crime_Year_CommArea.to_csv('Crime_Year_CommArea.csv')

These 2 DataFrame will be useful to compute the correlation, and for the different visualisations

In [2]:
eat_seat_facility_2 = pd.read_csv('eat_seat_facility_2.csv')

Here we load the eat_seat_facility_2 DataFrame

In [3]:
eat_seat_facility_2.head(3)
Out[3]:
License DBA Name Community Area InspectionScore
0 2.0 COSI 32.0 8.611111
1 9.0 XANDO COFFEE & BAR / COSI SANDWICH BAR 32.0 30.900000
2 40.0 COSI 32.0 34.000000
In [10]:
eat_seat_facility_3 = eat_seat_facility_2[['License','Community Area']].copy().groupby('Community Area').count()

Here we group by community Area on License to compute the number of restaurant for each Community Area

In [11]:
eat_seat_facility_3.rename(columns={'License' : 'N of restaurants'})
Out[11]:
N of restaurants
Community Area
0.0 266
1.0 382
2.0 461
3.0 459
4.0 383
... ...
73.0 149
74.0 83
75.0 130
76.0 71
77.0 451

78 rows × 1 columns

In [12]:
eat_seat_facility_3.to_csv('eat_seat_facility_3.csv')

Storing the DataFrame

6. Analysis

Evolution

Map visualization with Folium - Graph visualization with plot.ly

Now that we have computed the crime score and the hygiene score for every community area in Chicago, we want to do some visualization with maps. For this, we will use Folium. It allows us to create a map of Chicago and draw the boudaries of all the community areas. For this we use the GeoJSON file found on the Chicago Data Portal https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6 (corresponding to the csv file used earlier to define the boudaries of the community areas). Then we use color scales to determine where there is more crime and non hygienic facilities.

Hygiene scores per community area

In [2]:
#opening the dataframe saved in the precedent part
eat_year_commareas = pd.read_csv('Eat_Year_CommArea.csv',sep=',')
eat_year_commareas.head(3)
Out[2]:
Unnamed: 0 Community Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 0 0.0 23.468966 20.407583 14.798319 21.663636 20.144981 33.620438 30.505495 19.880435 44.850000 42.034615
1 1 1.0 39.251572 53.355000 62.423645 43.438053 22.062201 56.381395 67.159236 25.429104 55.022831 118.766129
2 2 2.0 58.363636 62.550336 61.764493 57.908163 26.806202 20.509363 68.905556 58.531579 116.579114 80.261194
In [3]:
#getting rid of decimal in Community Area and turning them into string type
eat_year_commareas['Community Area'] = eat_year_commareas['Community Area'].astype('int')
eat_year_commareas['Community Area'] = eat_year_commareas['Community Area'].astype('str')
eat_year_commareas.head(3)
Out[3]:
Unnamed: 0 Community Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 0 0 23.468966 20.407583 14.798319 21.663636 20.144981 33.620438 30.505495 19.880435 44.850000 42.034615
1 1 1 39.251572 53.355000 62.423645 43.438053 22.062201 56.381395 67.159236 25.429104 55.022831 118.766129
2 2 2 58.363636 62.550336 61.764493 57.908163 26.806202 20.509363 68.905556 58.531579 116.579114 80.261194
In [4]:
#geojson file containing the boundaries for each community area in Chicago
comareas_geo = r'Boundaries.geojson'
In [5]:
#creation of the map centered on Chicago city
eat_map_2010 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

#adding to the map the boundaries and the hygiene scores of 2010
folium.Choropleth(name='Hygiene scores in 2010',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2010'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2010'
                 ).add_to(eat_map_2010)

eat_map_2010.save('eat_map_2010.html')
eat_map_2010
Out[5]:
In [6]:
#same as above for 2011

eat_map_2011 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2011',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2011'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2011'
                 ).add_to(eat_map_2011)

eat_map_2011.save('eat_map_2011.html')
eat_map_2011
Out[6]:
In [7]:
#same as above for 2012

eat_map_2012 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2012',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2012'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2012'
                 ).add_to(eat_map_2012)

eat_map_2012.save('eat_map_2012.html')
eat_map_2012
Out[7]:
In [8]:
#same as above for 2013

eat_map_2013 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2013',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2013'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2013'
                 ).add_to(eat_map_2013)

eat_map_2013.save('eat_map_2013.html')
eat_map_2013
Out[8]:
In [9]:
#same as above for 2014

eat_map_2014 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2014',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2014'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2014'
                 ).add_to(eat_map_2014)

eat_map_2014.save('eat_map_2014.html')
eat_map_2014
Out[9]:
In [10]:
#same as above for 2015

eat_map_2015 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2015',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2015'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2015'
                 ).add_to(eat_map_2015)

eat_map_2015.save('eat_map_2015.html')
eat_map_2015
Out[10]:
In [11]:
#same as above for 2016

eat_map_2016 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2016',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2016'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2016'
                 ).add_to(eat_map_2016)

eat_map_2016.save('eat_map_2016.html')
eat_map_2016
Out[11]:
In [12]:
#same as above for 2017

eat_map_2017 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Hygiene scores in 2017',
                  geo_data = comareas_geo, 
                  data = eat_year_commareas,
                  columns = ['Community Area', '2017'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 20, 40, 60, 80, 100, 120],
                  legend_name = 'Hygiene score per community area in 2017'
                 ).add_to(eat_map_2017)

eat_map_2017.save('eat_map_2017.html')
eat_map_2017
Out[12]:

Comments :

As we can see from the maps, there is no particular trend, neither a generalized increase or decrease.

We can see that only a few community areas went over a hygiene score of a hundred : Burnside (47) in 2013 and Washington Park (40) and Woodlawn (42) in 2015.

We could assume that they are the worst community areas to go eating alongside with some others that stood out with a hygiene score superior to eighty as Brighton Park (58) in 2010, Austin (25) in 2012, South Deering (51) in 2014, East Garfield Park (27) and Hyde Park (41) in 2015, West Garfield Park (26) in 2016 and Hermosa (20) in 2017.4

Dataframe Loading

The com_eat DataFrame gathers the mediane of the HygieneScores of all the inspections for each Community Area per Year.

In [13]:
com_eat = pd.read_csv('Eat_Year_CommArea.csv',sep=',') #creation of the dataframe
In [14]:
com_eat['Community Area'] = com_eat['Community Area'].apply(str)
In [15]:
com_eat = com_eat.drop(0)
com_eat = com_eat.drop(columns='Unnamed: 0')
In [16]:
com_eat.head()
Out[16]:
Community Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
1 1.0 39.251572 53.355000 62.423645 43.438053 22.062201 56.381395 67.159236 25.429104 55.022831 118.766129
2 2.0 58.363636 62.550336 61.764493 57.908163 26.806202 20.509363 68.905556 58.531579 116.579114 80.261194
3 3.0 56.081448 65.475904 29.535971 35.849057 56.966759 34.114035 41.937870 67.795455 73.959732 89.592920
4 4.0 73.778351 42.085526 37.649746 35.472222 35.032609 32.761538 39.003534 54.254545 54.326733 116.350785
5 5.0 27.909574 45.666667 40.171271 40.937824 23.353488 39.789744 18.733624 22.330396 46.453333 86.061224
In [17]:
com_eat['Median'] = com_eat.drop(columns='Community Area').median(axis=1)
com_eat.head()
Out[17]:
Community Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Median
1 1.0 39.251572 53.355000 62.423645 43.438053 22.062201 56.381395 67.159236 25.429104 55.022831 118.766129 54.188916
2 2.0 58.363636 62.550336 61.764493 57.908163 26.806202 20.509363 68.905556 58.531579 116.579114 80.261194 60.148036
3 3.0 56.081448 65.475904 29.535971 35.849057 56.966759 34.114035 41.937870 67.795455 73.959732 89.592920 56.524103
4 4.0 73.778351 42.085526 37.649746 35.472222 35.032609 32.761538 39.003534 54.254545 54.326733 116.350785 40.544530
5 5.0 27.909574 45.666667 40.171271 40.937824 23.353488 39.789744 18.733624 22.330396 46.453333 86.061224 39.980507

General Visualization

The following figure allows to visualize the HygieneScores per Year for each Community Area. It is possible to unselect the year we don't want to display clicking on its label on the right of the figure.

In [24]:
import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
In [25]:
commarea = com_eat['Community Area']

fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
av = []
for x in range(2010, 2018) :
    x = str(x)
    fig.add_trace(go.Bar(x=commarea, y=com_eat[x], name=x, marker_color=colorlist[c]))
    c += 1
    
fig.add_trace(go.Scatter(x=commarea, y=com_eat['Median'], name='Median', marker_color='black'))
    
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.update_layout(title_text='Hygiene Score per Community Area')

fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_Bars.html', auto_open=True)
Out[25]:
'HygieneScore_ComArea_Bars.html'

Comments :

The figure allows to see in details the HygieneScores of each Community Area. The Mean line helps to see the bars tending to deviate from the Median computed.

What we can see is that there is no particular trend, with a Median oscillating between about 30 and 60.

The entry which really is above the rest is the Community Area 47 with a Median at 4, except in 2013 it has low HygieneScores. According to http://www.encyclopedia.chicagohistory.org/pages/194.html, the 47th Community Area of Chicago, called Burnside, is situated on the border of the city and is a "comfortable residential community".

In [8]:
commarea = com_eat['Community Area']

fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
    x = str(x)
    fig.add_trace(go.Scatter(x=commarea, y=com_eat[x], name=x, marker_color=colorlist[c]))
    c += 1
    
#fig.add_trace(go.Scatter(mode='markers', x=commarea, y=com_eat['Median'], name='Median', marker_color='black'))
    
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.update_layout(title_text='Hygiene Score per Community Area')

fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_Scatter.html', auto_open=True)
Out[8]:
'HygieneScore_ComArea_Scatter.html'

Comments :

The figure has been computed to get another point of view of the HygieneScores. The same constatations are done than before, which confirm the no particular trend.

Visualization Year By Year

The following figure allows to visualize the results in the ascending order - from top to bottom : the top Community Areas have the best HygieneScore (the lowest).

In [9]:
com_eat2010 = com_eat.sort_values(by=['2010'], ascending=False)
In [10]:
commarea = com_eat2010['Community Area']
fig = go.Figure(go.Bar(x=com_eat2010['2010'], y=commarea, orientation='h'))

fig.update_layout(title_text='Hygiene Score per Community Area in 2010', barmode='stack',\
                  yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))

fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2010.html', auto_open=True)
Out[10]:
'HygieneScore_ComArea_2010.html'
In [11]:
com_eat2011 = com_eat.sort_values(by=['2011'], ascending=False)
In [12]:
commarea = com_eat2011['Community Area']
fig = go.Figure(go.Bar(x=com_eat2011['2011'], y=commarea, orientation='h'))

fig.update_layout(title_text='Hygiene Score per Community Area in 2011', barmode='stack',\
                  yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))

fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2011.html', auto_open=True)
Out[12]:
'HygieneScore_ComArea_2011.html'

Comments :

Again, the two figures here show that there is no particular trend. No need to display for each year : neither the top 10 nor the bottom 10 are alike, except for the 47th Community Area which stays at the top 1.

For the 47th Community Area, https://en.wikipedia.org/wiki/Burnside,_Chicago#cite_note-cmap-1 explains that it has known the White Flight phenomenon (https://en.wikipedia.org/wiki/White_flight), leading the businesses to move away. The low HygieneScores could either be explained by a smallest number of establishments, but also as said before because of the confort of the zone which could ease the maintenance of the establishments.

Another view

In [ ]:
commarea = com_eat['Community Area']

fig = go.Figure()
colorlist = ('firebrick', 'indianred', 'lightsalmon', 'darkorchid', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
    x = str(x)
    fig.add_trace(go.Bar(x=com_eat[x], y=com_eat['Community Area'], orientation='h', name=x, marker_color=colorlist[c]))
    c += 1
    
fig.update_layout(title_text='Hygiene Score per Community Area', barmode='stack',\
                yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))
    
fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2.html', auto_open=True)

Correlation

The corr function give the pearson coefficient between the HygieneScores and the Community Areas.

In [13]:
heat = com_eat.drop(columns='Median').corr()
heat
Out[13]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
2010 1.000000 0.234352 0.017041 -0.080688 0.073732 0.157595 0.205825 0.255774 0.163786 -0.042699
2011 0.234352 1.000000 0.390623 0.016596 0.076792 0.191203 0.181480 0.359636 0.497488 0.139233
2012 0.017041 0.390623 1.000000 0.283086 0.141193 0.411255 0.173338 0.276176 0.391664 0.205324
2013 -0.080688 0.016596 0.283086 1.000000 -0.132036 0.015564 0.060249 -0.032524 0.134067 0.267134
2014 0.073732 0.076792 0.141193 -0.132036 1.000000 -0.106756 -0.135195 -0.023927 -0.086958 0.125506
2015 0.157595 0.191203 0.411255 0.015564 -0.106756 1.000000 0.382015 0.037763 0.329956 -0.034975
2016 0.205825 0.181480 0.173338 0.060249 -0.135195 0.382015 1.000000 -0.047673 0.337250 0.091819
2017 0.255774 0.359636 0.276176 -0.032524 -0.023927 0.037763 -0.047673 1.000000 0.419489 -0.076904
2018 0.163786 0.497488 0.391664 0.134067 -0.086958 0.329956 0.337250 0.419489 1.000000 -0.071025
2019 -0.042699 0.139233 0.205324 0.267134 0.125506 -0.034975 0.091819 -0.076904 -0.071025 1.000000
In [14]:
fig = go.Figure(data=go.Heatmap(x = heat.index, y = heat.index, z=heat))
layout = go.Layout(title = 'Pearson Coeficient HygieneScore - Community Area')
fig.show()

plotly.offline.plot(fig, filename = 'Corr_eat.html', auto_open=True)
Out[14]:
'Corr_eat.html'

Conclusion :

Considering the constant variations in the food domain, the results obtained could simply indicated that the inspections are fair, following the phenomenon of changes, unpredictable due to the behaviors in the restauration industry.


Dataframe Loading

The types DataFrame gathers the result of the inspections for each FacilityGroup.

In [15]:
types = pd.read_csv('Eat_Type_Year.csv',sep=',') #creation of the dataframe
In [16]:
types = types.drop(0)
types = types.drop(columns='Unnamed: 0')

To begin, the statut of the establishment (Private or Public) is added to the DataFrame so that the difference between the two could be well visualized.

In [17]:
def sep(row):
    if row['FacilityGroup'] in ['bakery_restaurant', 'grocery_restaurant', 'bar_restaurant', 'banquet',\
                               'catering', 'restaurant', 'golden_diner', 'rooftop_restaurant', 'liquor_restaurant'] :
        return 'Public'
    if row['FacilityGroup'] in ['adultcare', 'childrens_services', 'daycare', 'school'] :
        return 'Private'
In [18]:
types['Stat'] = types.apply (lambda row: sep(row), axis=1)
types
Out[18]:
FacilityGroup 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Stat
1 adultcare 33.910798 39.953846 35.508380 38.783505 40.306667 33.133047 38.318182 41.488000 73.512605 62.645933 Private
2 bakery_restaurant 5.000000 24.000000 64.000000 81.500000 8.000000 NaN NaN NaN NaN NaN Public
3 banquet 4.000000 81.937500 63.000000 50.793103 42.250000 18.608696 14.931034 27.533333 163.368421 90.312500 Public
4 bar_restaurant 76.250000 4.500000 45.555556 64.000000 69.500000 69.166667 48.625000 27.500000 91.333333 24.500000 Public
5 catering 42.032258 32.092784 38.716216 24.678161 24.304762 37.810345 32.943548 32.922330 44.378049 65.550562 Public
6 daycare 34.079172 31.232648 35.620979 29.519380 34.801170 19.871795 23.569061 25.119181 43.782609 46.995305 Private
7 golden_diner 30.386364 43.945946 9.621622 25.774194 15.529412 24.622951 24.681159 28.096774 45.967213 37.142857 Public
8 grocery_restaurant 50.058824 33.687500 59.625000 45.916667 19.071429 15.875000 57.312500 81.142857 109.000000 110.444444 Public
9 liquor_restaurant NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN Public
10 restaurant 40.378860 41.392556 40.047406 35.031800 36.985640 41.157266 40.921986 41.441261 60.265968 90.780985 Public
11 rooftop_restaurant 0.000000 55.250000 19.000000 2.000000 14.000000 1.333333 1.500000 NaN NaN NaN Public
12 school 23.309795 28.157504 32.976542 24.825022 28.070473 30.290861 31.385321 30.081456 41.584048 40.268215 Private

Then two DataFrames private and public are created slicing the types DataFrame. This way, they can be put on a same figure but well separated.

In [19]:
private = types.loc[types['Stat'] == 'Private']
private = pd.DataFrame(private)
private['Median'] = private.median(axis=1)

public = types.loc[types['Stat'] == 'Public']
public = pd.DataFrame(public)
public['Median'] = public.median(axis=1)
In [20]:
fig = go.Figure()
colorlist1 = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate', 'lightgoldenrodyellow', 'lightgray')
colorlist2 = ('lightgrey', 'lightgreen', 'magenta', 'maroon', 'mediumaquamarine', 'mediumorchid', 'mediumpurple', 'purple', 'red', 'rosybrown')
c = 0
for x in range(2010, 2020) :
    x = str(x)
    fig.add_trace(go.Bar(x=private.FacilityGroup, y=private[x], name='Private'+x, marker_color=colorlist1[c]))
    fig.add_trace(go.Bar(x=public.FacilityGroup, y=public[x], name='Public'+x, marker_color=colorlist2[c]))
    c += 1

    # Here we modify the tickangle of the xaxis, resulting in rotated labels.
    fig.update_layout(barmode='group', xaxis_tickangle=-45)

    fig.update_layout(title_text='HygieneScore per Facility Type', xaxis=dict(title='Facility Type'), yaxis=dict(title='HygieneScore'))

fig.add_trace(go.Scatter(x=private.FacilityGroup, y=private['Median'], name='Median', marker_color='black', mode='lines'))
fig.add_trace(go.Scatter(x=public.FacilityGroup, y=public['Median'], name='Median', marker_color='black', mode='lines'))
fig.show()

plotly.offline.plot(fig, filename = 'HygieneScore_FacilityType.html', auto_open=True)
Out[20]:
'HygieneScore_FacilityType.html'

Comments :

On this figure, we can study the differences between Public and Private establishments.

The same no particular trend can be observed for the Public establishments, except for the Restaurant type of facility which happens to have no important variations until 2018 where, then for 2019, the rise of the HygieneScores is huge. The Restaurant type of facility is the one with the more entries in the dataset.

For the Private establishments, we can say that the HygieneScores are more stable. They seems to follow the same trend, with a rise in 2018 and 2019.

Conclusion :

The Private establishements are the more sensitive. Because of the way they works - children, elderly are in their care and are often expensive - they generally have to follow specific rules. Their particular duty could explain the fact that their results are more constant than the ones of the Public establishments.


Dataframe Loading

The business DataFrame gathers the Owner Name, Number of Restaurants by Owner and HygieneScores .

In [38]:
business = pd.read_csv('license_owner_score_number.csv',sep=',') #creation of the dataframe
In [39]:
business.head()
Out[39]:
full name InspectionScore N Restaurants
0 A MOHIT 0.000000 1
1 AAMIR DAWOOD 40.666667 3
2 AAMIR HUSSAIN 40.000000 1
3 AAMIR KHAN 2.000000 1
4 AARON BUSH 109.000000 1
In [40]:
business = business[['InspectionScore','N Restaurants']].copy().groupby('N Restaurants').mean()
In [41]:
business = business.reset_index()

Correlation

The corr function give the pearson coefficient between the HygieneScores and the Number of Restaurants by Owner.

In [42]:
business['InspectionScore'].corr(business['N Restaurants'])
Out[42]:
-0.29536601012489433
In [43]:
business.corr()
Out[43]:
N Restaurants InspectionScore
N Restaurants 1.000000 -0.295366
InspectionScore -0.295366 1.000000

Conclusion :

The HygieneScores and the Number of Restaurants by Owner are not correlated following the Pearson method.

The result obtained is contrary to our thinking : we would have thought that the more establishments a owner has, the more able he is to enforce rules fitting the Food Code - owning several establishements would induce more experiment and resources. Apparently, this is not the case !


Crime scores per community area

We do exactly the same than before with the hygiene scores, with the dataframe that contains the crime score per community area per year.

In [26]:
#opening the dataframe saved in the precedent part
crime_year_commareas = pd.read_csv('Crime_Year_CommArea.csv',sep=',')
crime_year_commareas.head(3)
Out[26]:
Unnamed: 0 Community Area 2010 2011 2012 2013 2014 2015 2016 2017
0 0 0.0 0.3 0.5 4.4 4.1 1.1 1.1 NaN NaN
1 1 1.0 3572.3 3329.1 2874.9 2447.0 2262.3 2196.0 2380.6 86.0
2 2 2.0 3975.1 3584.9 3170.7 2526.7 2440.4 2238.1 2497.2 158.2
In [27]:
#getting rid of decimal in Community Area and turning them into string type
crime_year_commareas['Community Area'] = crime_year_commareas['Community Area'].astype('int')
crime_year_commareas['Community Area'] = crime_year_commareas['Community Area'].astype('str')
crime_year_commareas.head(3)
Out[27]:
Unnamed: 0 Community Area 2010 2011 2012 2013 2014 2015 2016 2017
0 0 0 0.3 0.5 4.4 4.1 1.1 1.1 NaN NaN
1 1 1 3572.3 3329.1 2874.9 2447.0 2262.3 2196.0 2380.6 86.0
2 2 2 3975.1 3584.9 3170.7 2526.7 2440.4 2238.1 2497.2 158.2
In [28]:
#creation of the map centered on Chicago city
crime_map_2010 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

#adding to the map the boundaries and the crime scores of 2010
folium.Choropleth(name='Crime scores in 2010',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2010'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2010'
                 ).add_to(crime_map_2010)

crime_map_2010.save('crime_map_2010.html')
crime_map_2010
Out[28]:
In [29]:
#same as above for 2011

crime_map_2011 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2011',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2011'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2011'
                 ).add_to(crime_map_2011)

crime_map_2011.save('crime_map_2011.html')
crime_map_2011
Out[29]:
In [30]:
#same as above for 2012

crime_map_2012 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2012',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2012'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2012'
                 ).add_to(crime_map_2012)

crime_map_2012.save('crime_map_2012.html')
crime_map_2012
Out[30]:
In [31]:
#same as above for 2013

crime_map_2013 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2013',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2013'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2013'
                 ).add_to(crime_map_2013)

crime_map_2013.save('crime_map_2013.html')
crime_map_2013
Out[31]:
In [32]:
#same as above for 2014

crime_map_2014 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2014',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2014'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2014'
                 ).add_to(crime_map_2014)

crime_map_2014.save('crime_map_2014.html')
crime_map_2014
Out[32]:
In [33]:
#same as above for 2015

crime_map_2015 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2015',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2015'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2015'
                 ).add_to(crime_map_2015)

crime_map_2015.save('crime_map_2015.html')
crime_map_2015
Out[33]:
In [34]:
#same as above for 2016

crime_map_2016 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2016',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2016'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2016'
                 ).add_to(crime_map_2016)

crime_map_2016.save('crime_map_2016.html')
crime_map_2016
Out[34]:
In [35]:
#same as above for 2017

crime_map_2017 = folium.Map(location=[41.85, -87.55],
                 tiles='Stamen Terrain',
                 zoom_start=9.5,
                 )

folium.Choropleth(name='Crime scores in 2017',
                  geo_data = comareas_geo, 
                  data = crime_year_commareas,
                  columns = ['Community Area', '2017'],
                  key_on = 'feature.properties.area_num_1',
                  fill_color = 'YlOrRd',
                  fill_opacity = 0.7,
                  line_opacity = 0.2,
                  threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
                  legend_name = 'Crime score per community area in 2017'
                 ).add_to(crime_map_2017)

crime_map_2017.save('crime_map_2017.html')
crime_map_2017
Out[35]:

Comments :

As we can see from the maps, the crime score stays more or less stable with a slight decrease from 2010 to 2015. In 2017, we can observe very low crime scores everywhere. As we read nothing that could explain this, we think it is an issue in the dataset, maybe a non-complet dataset for this year.

We can also see that one community area particularly stands out every year (except 2017) : Austin. Getting information on this community area we find it is one of the most populated community area and that there is in Austin a lot of violent crimes. That explains the high crime score that it gets as crimes scores are calculated based on the penalty of the crimes.

Dataframe Loading

The crime DataFrame gathers the mediane of the CrimeScores of all the inspections for each Community Area per Year.

In [21]:
crime = pd.read_csv('Crime_Year_CommArea.csv',sep=',') #creation of the dataframe
In [22]:
crime = crime.drop(0)
crime = crime.drop(columns='Unnamed: 0')
In [23]:
crime.head()
Out[23]:
Community Area 2010 2011 2012 2013 2014 2015 2016 2017
1 1.0 3572.3 3329.1 2874.9 2447.0 2262.3 2196.0 2380.6 86.0
2 2.0 3975.1 3584.9 3170.7 2526.7 2440.4 2238.1 2497.2 158.2
3 3.0 2884.2 2849.6 3032.2 2244.6 1917.7 2222.5 2110.1 120.2
4 4.0 1851.2 1679.5 1767.3 1529.3 1173.1 1037.9 1274.5 54.1
5 5.0 1592.9 1701.6 1664.4 1379.3 1145.8 958.6 937.3 30.7
In [24]:
crime['Community Area'] = crime['Community Area'].apply(str)

General Visualization

The following figure allows to visualize the CrimeScores per Year for each Community Area. It is possible to unselect the year we don't want to display clicking on its label on the right of the figure.

In [25]:
commarea = crime['Community Area']

fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
    x = str(x)
    fig.add_trace(go.Bar(x=commarea, y=crime[x], name=x, marker_color=colorlist[c]))
    c += 1
    
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.update_layout(title_text='Crime Score per Community Area')

fig.show()

plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_Bars.html', auto_open=True)
Out[25]:
'CrimeScore_ComArea_Bars.html'

Comments :

This figure allows to see in details the CrimeScores of each Community Area.

What we can already see is that the CrimeScore are very different inter-Community Areas, but stay in the same ranges intra-Community Areas. Mostly, the CrimeScores have decreased between 2010 and 2015, then have rised in 2016. This rise cannot be explain by our analysis of the Chicago Food Inspections, but it already has been discussed. Quickly searching on Google, many articles relate this trend :

"The city's overall crime rate, especially the violent crime rate, is higher than the US average. Chicago was responsible for nearly half of 2016's increase in homicides in the US, though the nation's crime rates remain near historic lows. The reasons for the higher numbers in Chicago remain unclear." - https://en.wikipedia.org/wiki/Crime_in_Chicago

Apparently, their analysis of the Chicago Food Inspections could not have explained it either.

The second observation is that the numbers for 2017 are very low. The high difference with the rest of the data is not normal, so that we think that there is a problem with the file used.

In [26]:
commarea = crime['Community Area']

fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
    x = str(x)
    fig.add_trace(go.Scatter(x=commarea, y=crime[x], name=x, marker_color=colorlist[c]))
    c += 1
    
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.update_layout(title_text='Crime Score per Community Area')

fig.show()

plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_Scatter.html', auto_open=True)
Out[26]:
'CrimeScore_ComArea_Scatter.html'

Comments :

The point of view on this figure allows to well visualized the fact that there is a general trend in which the CrimeScores varie linearly by CommunityArea.

Visualization Year By Year

The following figure allows to visualize the results in the descending order : the top Community Areas have the best CrimeScore (the lowest).

In [27]:
crime2010 = crime.sort_values(by=['2010'], ascending=False)
In [28]:
commarea = crime2010['Community Area']
fig = go.Figure(go.Bar(x=crime2010['2010'], y=commarea, orientation='h'))

fig.update_layout(title_text='Crime Score per Community Area in 2010', barmode='stack',\
                  yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='CrimeScore'))

fig.show()

plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_2010.html', auto_open=True)
Out[28]:
'CrimeScore_ComArea_2010.html'
In [29]:
crime2011 = crime.sort_values(by=['2011'], ascending=False)
In [30]:
commarea = crime2011['Community Area']
fig = go.Figure(go.Bar(x=crime2011['2011'], y=commarea, orientation='h'))

fig.update_layout(title_text='Crime Score per Community Area in 2011', barmode='stack',\
                  yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='CrimeScore'))

fig.show()

plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_2011.html', auto_open=True)
Out[30]:
'CrimeScore_ComArea_2011.html'

Comments :

This time, the two figures are showing the same trends. The same Community Areas are found, not exactly in the same order but still.

Correlation

The corr function give the pearson coefficient between the CrimeScores and the Community Areas.

In [31]:
heat = crime.drop(columns = '2017').corr()
heat
Out[31]:
2010 2011 2012 2013 2014 2015 2016
2010 1.000000 0.995976 0.988682 0.987405 0.988849 0.981853 0.974403
2011 0.995976 1.000000 0.993438 0.992617 0.990226 0.984236 0.973757
2012 0.988682 0.993438 1.000000 0.994380 0.991441 0.986420 0.978517
2013 0.987405 0.992617 0.994380 1.000000 0.990561 0.984646 0.975407
2014 0.988849 0.990226 0.991441 0.990561 1.000000 0.994928 0.990790
2015 0.981853 0.984236 0.986420 0.984646 0.994928 1.000000 0.991769
2016 0.974403 0.973757 0.978517 0.975407 0.990790 0.991769 1.000000
In [32]:
fig = go.Figure(data=go.Heatmap(x = heat.index, y = heat.index, z=heat))
layout = go.Layout(width = 700, height = 700, autosize = False , title = 'Pearson Coeficient HygieneScore - Community Area')
fig.show()

plotly.offline.plot(fig, filename = 'Corr_crime.html', auto_open=True)
Out[32]:
'Corr_crime.html'

Conclusion :

The CrimeScore really is related to the place.


Hygiene VS Crime

Visualization using a Double Y axis

In [33]:
for n in range(2010, 2018) :
    n = str(n)
    
    # Create figure with secondary y-axis
    fig = make_subplots(specs=[[{"secondary_y": True}]])

    # Add traces
    fig.add_trace(go.Scatter(x=crime['Community Area'], y=crime[n], name="CRIME" + n), secondary_y=False,)

    fig.add_trace(go.Scatter(x=com_eat['Community Area'], y=com_eat[n], name="HYGIENE" + n), secondary_y=True,)

    # Add figure title
    fig.update_layout(title_text="Crime and Hygiene Scores " + n)

    # Set x-axis title
    fig.update_xaxes(title_text="Community Area")

    # Set y-axes titles
    fig.update_yaxes(title_text="<b>CrimeScore</b>", secondary_y=False)
    fig.update_yaxes(title_text="<b>HygieneScore</b>", secondary_y=True)

    fig.show()
    
    plotly.offline.plot(fig, filename = 'Double_' + n + '.html', auto_open=True)

Comments :

It is hard to say anything relevant yet. Another calculation could be useful.

Dataframe Loading

The eat_crime DataFrame gathers the CrimeScores and HygieneScores of all the inspections for each Community Area.

In [34]:
eat_crime = pd.read_csv('Crime_Eat_CommArea.csv',sep=',') #creation of the dataframe
In [35]:
eat_crime = eat_crime.drop(0)
In [36]:
eat_crime.head()
Out[36]:
Community Area InspectionScore Crime Score
1 1.0 55.524547 26844.0
2 2.0 62.394390 28176.7
3 3.0 54.139052 23522.4
4 4.0 50.462842 14084.0
5 5.0 37.824034 13099.8

Correlation

The corr function give the pearson coefficient between the HygieneScores and the CrimeScores.

In [37]:
eat_crime['InspectionScore'].corr(eat_crime['Crime Score'])
Out[37]:
0.3059226707490215

Conclusion :

The CrimeScore and the HygieneScore are not correlated following the Pearson method.